Building a Online Sentiment Engine with Google Apps and App Engine: 2013

Wednesday, 11 December 2013

Automated Company Alerts from Twitter

Company Alerts from Twitter

So hot on the heals of our article on using movements in Twitter sentiment and volume as a pre-cursor to Bitcoin price movements.

Now we're excited to announce we've set up automated alerts for any subject which will be constantly monitored for spikes in sentiment and volume. Check out QBE Insurance from Australia which issued a profit warning on the 9th of December.

Here's the sentiment:

And here's the mention volume:

As with the Bitcoin scenario, a very nice correlation to the bad news which came out on the 9th of December.

How it all works

We configure the search criteria to ensure only the relevant tweets are extracted
SocialSamplr will figure out an average sentiment and mention count for a given subject over a default time-frame of a month (can be configured to any value).
The average will then be broken down to a default time-frame of 5 minutes (again this will be fully configurable - for example lower volume subjects would have a longer time-frame than 5 minutes so there's enough time for the tweet volume to be significant).
If a tolerance level for tweet volume (default of 100%) or sentiment (default of 30%) for the 10 minute period is breached an alert is sent. Once again, the tolerance levels will be fully configurable.
Once an alert is sent the application will then go into a "silent" period where no more alerts will be sent for a configurable period of time (like 24 hours).

Once the "silent" period has ended the application will calculate the sentiment/mention average during this period.
If it's significantly different than during the last month's (default over 25%) average the application will then use this as the new "baseline" average.
Otherwise it will revert back to using the monthly average as above.

If anyone wants to be an early-adopter for this exciting new functionality (and hence get to use it for free) just get in touch. Remember, we have access to every tweet so you miss nothing!

Tuesday, 3 December 2013

Well this is exciting...

Can't wait to see what a company like Apple plans to do now.... http://memeburn.com/2013/12/um-what-now-apple-buys-twitter-analytics-service-topsy-for-200m/

Sunday, 1 December 2013

Trading Bitcoin Using Twitter Data

Finding Uses for Twitter Data

One of the more sobering lessons getting this venture up and running has been the process where we demonstrate the software, get loads of great feedback, people use an evaluation account but then at the end of the evaluation are struggling to see how they can actually use the data in their job. This blog is to cover a very interesting (and potentially very lucrative) use for social media data, accessed through www.socialsamplr.com.

The usual reasons people like to track and mine social media data which tend to revolve around brand and marketing managers looking to measure the amount of social media chatter there is about their brand, how positive (or negative) it is and how they compare to competitors. We've also had interest from academic types who like to mine data, particularly in relation to political news to see if there's any correlation between social media data and election results. But today I'm going to look ways that using social media data can actually be applied to making money directly - which is a much more attractive option.

Trading using Social Media Data

Trading highly liquid equities (in particular ones that are prone to volatile changes in share price) has for some time now been using social media data to help drive decision making - in some cases fully automated - when buy or selling shares. Just look at the momentary plunge the Dow took when the white house twitter account was hacked and told the world the white house had been bombed. By the time the world had worked out it was a load of bollocks there would have been plenty of time for the first-in traders to short loads of stock and buy back in when the market had come to it's senses.

Now for the average punter to set up a continuous feed to Gnip or Datasift and have software that can pick out the sentiment of a continuous stream of financial data would be onerous in the least - game, set and match to the banks you'd think. Well, not quite. We're currently developing a fully automated alert system for any subject on Twitter where the sentiment and / or volume "spikes" suddenly beyond a certain tolerance level. This should all be ready to use on a plug-and-play basis from next year. However, you can already use SocialSamplr as a basis to get pre-cursors for market data and to prove it let's take a look at something that's been in the news in a big way over the last month.

Bitcoin

In case you've been living in a mountain cavern over the last month with no access to electricity you can read up about bitcoin here. What I'm going to show here is how socialsamplr can actually show pre-cursors to movements in the value of bitcoin. It's a good one to show as it's topical (hence lots of stuff on Twitter about), volatile as buggery and since it's so open there's loads of data floating around about it which is accessible for free. Needless to say if you do actually use this for trading this will be at 100% your own risks and I'll accept liability for nothing (and you'll need balls of steel, this stuff is very very volatile). Anyway, let's look at the price history of bitcoin over the last month:

It doesn't take Einstein to see the price climbed in a big way, specifically on the 19th of November and pretty much went north from there. So now lets take a look at SocialSamplr and some data around this time. First off let's check out mention volume and sentiment across November (you can query this yourself on the site so can run these for "real" yourself. Volume when graphed daily largely matches the spike, except you'll see it starts to head north on the 18th of November - a precursor maybe?

Volume of Twitter Activity:

Then after that check out the sentiment chart.

Sentiment Score of Twitter Activity:

Now add in the fact that if you sign up to our enterprise account you can get down to the minute level reporting and you can really measure when and how any "buzz" starts occurring in Twitter. This is where things really do get interesting. See the breakdown of hourly stats below (reading through a JSON viewer - at around 11 at on the 17th of November (number 11 in the sequence, the date/times are unix timestamps so need to be converted) you see a big jump in mention volume. This is followed by an even bigger jump the next hour to 1053. Sentiment looks more up and down so what we're seeing here is really a combination of volume increase and sentiment volatility.

So then we take a closer look at bitcoin prices around this time. Now the really big climbs in price aren't happening until well into the 18th and, in fact, even the 19th of November.

So in summary it looks like there is some pretty strong correlation to market movements by activity in Twitter - and in the example above it gives you a full 24 hours to get trading before the price starts to climb. I think the key thing is social media is a way to try and get the inside running before a big run up or down in price - once the run has started it's pretty much as useful as tits to a bull since every Tom, Dick and Harry is then tweeting about it. Personally I don't really see loads of value in paying a PhD in statistics to crunch the data for these purposes - the trend is obvious and it's more a case of moving quickly to maximise the opportunity. There's well documented cases of other stocks (like Netflix) showing a similar correlation. An example of a stock which wouldn't be so great is the NZ tech darling Xero - the sentiment is consistently 90% positive and pretty much shows no correlation to the share price movements, so it's really a case of horses-for-courses. We're predicting over time for highly liquid, highly "visible" stocks this will become a fairly standard way of gleaning real-time market data. We'll keep refining our trading algorithm and with a look at doing some equities investment off the back of it around mid next year - rest assured I will update on how that progresses.

Hope this was of interest (it certainly is to me) - get in touch if you want to look at using socialsamplr yourself to gain insight into market activity. 14 day free trial, best prices around for full access to the Twitter firehose.

Whatever you do, don't wind up like this guy.

Wednesday, 10 July 2013

Twitter Streaming Widget

Just a very quick note on this post - we're head long into completing development of our initial product and have had some great commitments from our beta-test companies so it's obviously a very exciting time for us. Will have a load of updates once we've done the release of the product.

In the meantime, checkout the prototype Twitter streaming widget - we'll be making this available as part of our product offering for people to plug into their site. At the moment the prototype is just hard-coded to "obama" but the final product will be configurable to any number of subjects in real-time. Check it out and if you're interested give us a shout about using it (just click "activate" to kick off the stream).

www.socialsamplr.com/twitterstream

Saturday, 18 May 2013

Using BigQuery and Memcache To Improve Performance

Update on SocialSamplr

Been a while since my last post, but I've been very busy developing our social media scoring platform so it's been hard to find the time to write much. Things have been progressing very well and socialsamplr is now on-boarded to the Creative HQ business incubator here in Wellington NZ and we're looking forward to accelerating things a lot from here on in.
We've been following the "lean startup" process and think we've got the information now to build a pretty cool and unique minimum viable product so stay tuned on how it progresses. If anyone is interested in discussing what the product is going to offer or to even be beta testers please contact me at danshannon@socialsamplr.com. Also, being a cash-strapped startup, if anyone's interested in the source code for the Google apps script stuff I was working on last year I'd be happy to provide this for a small fee along with some training data to create your own sentiment engine (kind of a "build your own sentiment engine starter pack"). Again, just drop me line and we can work something out (here's the link to see that in action https://script.google.com/macros/s/AKfycbxr5a4NUFBTVzxX3GJtW2tbiW3A6K5d06pVo4DqZD3fUT0_pITy/exec).

Otherwise, onto today's post. As always, to see these stuff in action just visit www.socialsamplr.com.

Speeding up Response Times Using Caching

On the plus side I've been developing with a lot of the really cool features available in App Engine so I thought it'd be a good chance to share some of the "quirks" I've noticed on how to get things running well. The focus of this blog will be on using memcache, bigquery and deferred tasks to enable your website to process high data loads while maintaining very high speed processing - instant processing of results has almost become the norm these days and so far from our site the experience we've had is that people expect nothing less.

So far in the last post I showed how it's possible to use deferred tasks to multi-thread and achieve improved performance (in our case for processing a large quantity of data quickly from multiple sources).

While the real-time sentiment scoring engine we've built provides a great way to search any subject from social media and get a score on how it is performing, for popular subjects it's far more efficient and user friendly to score them back-end and then cache the data for the user. Also, the way App Engine is charged using the deferred tasks to run a large number of asynchronous processes does quickly lead to compute time being chewed up. The screen shot below shows what happens when you fire off a large number of deferred tasks - you wind up with app engine spinning up numerous instances to service all the requests. Pretty cool technology and also very powerful but like I said it can get a bit expensive on your computing resource. The other thing to be aware of is being sure you handle your exceptions gracefully in code or else the deferred tasks can attempt multiple retries - again chewing into your compute time usage.

Using caching techniques you can effectively provide a close-to instant result which is then enriched further in the background. The next time another searches for the same subject it then includes the extra data. The way I think of it is a "virtuous circle" of data being updated in the background so it can be presented instantly to users on the front end.

Big Query

So the first question is where are you going to store the data you need to query - and this is where Google Big Query is perfect. For the instance I'm describing we're querying up to 7 days of data at any time but even that, with the ambitions we have for the platform, means that we're potentially going to be dealing with some fairly large volumes of data and big query gives us the perfect platform to do this on. The other option is to use the data store in App Engine but I prefer BigQuery as it has a built in query interface and is easily accessible in other ways if you want to do any further data mash-ups - for example with Google Apps Script.

So the steps I take with big query to load data for caching are as follows.

First ensure the administration account for your application hosted on App Engine has the necessary rights in your Google Apps console to read and write to your instance of BigQuery
Test your app engine account has access to big query. Using the Google App Engine API has the benefit where you won't need to OAuth your account in - it can effectively access Big Query directly. Here's some sample code you can use to test with. Otherwise, there's plenty of app engine documentation that covers how to do this.

import httplib2

from google.appengine.api import memcache

from apiclient.discovery import build

from oauth2client.appengine import AppAssertionCredentials

def logResult(message,source,subject):

credentials = AppAssertionCredentials(

scope='https://www.googleapis.com/auth/bigquery')

http = credentials.authorize(httplib2.Http(memcache))

service = build("bigquery", "v2", http=http)

jobData={'projectId': 'xxxxxxxxxxx',

'configuration': {

'query':

{'query': 'SELECT \''+ message + '\' as message,\'' + source + '\' as source,\'' + subject + '\' as subject,NOW() as messagetime', 'destinationTable':

{ 'projectId': 'xxxxxxxxx', 'datasetId': 'projectName','tableId': 'tableName'},

'createDisposition': 'CREATE_IF_NEEDED','writeDisposition': 'WRITE_APPEND','defaultDataset':

{'datasetId': 'projectName', 'projectId': 'xxxxxxxxx'}}}}

job_runner = service.jobs()

results = job_runner.insert(projectId="xxxxxxxx",body=jobData).execute()

There's a couple of "gotchas" to be aware of here. One thing I've noticed is when there is a lot of writes to a table in big query - as above - over time it does seem to lead to issues when querying the same table. In my experience these take the form of "odd" errors that occur in your App Engine logs where it seems to just be unable to read from the table in Big Query. To resolve this I log all the responses and then have a back-end process running hourly to load that data into a secondary table which will be used for querying - and once we have the data extracted from the logging table we can then go ahead and drop the table. It leads to some slightly convoluted processing but from what I can see is Big Query is not really designed for transaction based processing and this approach is one that I've tested pretty thoroughly and works well.
Once you've established you can write and read to big query from your application you're ready for the next step in the process - using memcache to create a near-instant response for users.

Memcache

Using memcache is a great way to enhance the performance of your application. Memcache is essentially an in-memory dictionary for the temporary storage of data in a key/value format. By it's nature it's not stable for ongoing memory storage so it should be regularly cleared and refreshed for consistent application performance. For the SocialSamplr real-time score we use it for two main purposes.

Provide auto-complete functionality on the site of popular search topics (this is in combination with Twitter Bootstrap "typeahead" function).
Allow pre-cached (refreshed hourly) results to be displayed for popular subjects. This has the double benefit of greatly enhancing the user experience and also enabling me to control the amount of compute time and resources being used by my application.

So the way I've designed it for use in the real-time score element of SocialSamplr is as follows:

I have an hourly back-end process which refreshes the cache by clearing all the data in memcache using the "memcache.flush_all()" command.
I then load the necessary data to enable me to provide pre-cached results for the real-time score. This is by no means all of the data held in BigQuery but just the data I need to provide the functionality. Remember, bigquery allows aggregate queries to be run so enables you to get that to do a lot of the "heavy lifting" and can minimise the amount of data you need to store in memcache.
To add data to memcache it's simply then a case of calling "memcache.add(key,value,timeout)".
The timeout is in milliseconds, but shouldn't really be relied on. I set the value nice and high and it seems to hold the data in a pretty stable format.

The final piece of the puzzle is then displaying the cached data on the site. First, for the auto-complete I have the following code in the base html for the page (note I'm using Django for my site, supported on App Engine).

The {{getsubjects}} part is then replaced with the list of subjects which exists in memcache during the page load. Subsequently, when a user posts their search request back it knows which key to look for in memcache and return in the results instantly.

So in summary, all fairly straightforward when it's pieced together and provides a powerful means of allowing pre-caching of large and dynamic datasets on your site. And the best thing of all of course, being in the cloud, no infrastructure set-up to worry about at all.

All the best,

Daniel

Thursday, 10 January 2013

Ajax and deferred tasks on App Engine

Quick post this week. Development of the new website has been progressing well so I thought I'd share some of the details on developing nice client-side functionality using Ajax style calls and deferred tasks.

Just to recap, the new version of the socialsamplr site is being developed in Python using the Django framework. My previous posts have outlined how it can be easily and securely it can be integrated with Google apps script. This post will mainly be concerned around developing Ajax-style functionality on Python hosted on App Engine and using deferred tasks in the server-side processes to ensure parallel processes can be run for performance.

Coming from a .NET development background I see this as similar to running multi-threaded server side processes. I don't know want to give too much else away about what I'm specifically doing here as it will be touching on some of the IP I'm development - suffice to say running several server-side parallel processes from client side javascript calls to provide a result is what I'm doing here.

Note I won't be covering how to secure your ajax calls within your application in this post - I'm still finalising this in my app but I will make a mention of the approach I'm using.

Basics

So the basics of creating Ajax calls in your python web application are covered in the following excellent blog post https://developers.google.com/appengine/articles/rpc. Basically this walks you through creating a basic adding and subtracting function to your web page and should be fairly easy to implement if you want. Reading this article it'll take you through what you'll need to do, which is basically two parts.

Create a server-side RPC process to take the client side call and return a result.
Implement the relevant javascript code on the client to make the call to the RPC process and accept the result.

I'll cover the client side code first as it's fairly simple to implement, then I'll cover the server-side stuff where I do some funky functionality with the deferred tasks to enable parallel processing of the requests.

Client side code for enabling Ajax calls

To start with you'll need to include json2.js from http://www.json.org/js.html for this implementation (json is the most common data format for processing Ajax requests). Then you'll need to add the javascript code required and wire up the relevant events to HMTL controls on your page. All pretty simple stuff if you've done any web development before, and not even that difficult if you're new to it.

As my project uses the django implementation where I have a base.html file containing all core page code (for example header , footer and common javascript code) I've split the javascript into 2 parts. Part one, as follows goes into the base.html file:

// As mentioned at http://en.wikipedia.org/wiki/XMLHttpRequest

if( !window.XMLHttpRequest ) XMLHttpRequest = function()

{

try{ return new ActiveXObject("Msxml2.XMLHTTP.6.0") }catch(e){}

try{ return new ActiveXObject("Msxml2.XMLHTTP.3.0") }catch(e){}

try{ return new ActiveXObject("Msxml2.XMLHTTP") }catch(e){}

try{ return new ActiveXObject("Microsoft.XMLHTTP") }catch(e){}

throw new Error("Could not find an XMLHttpRequest alternative.")

};

// Makes an AJAX request to a local server function w/ optional arguments

// functionName: the name of the server's AJAX function to call

// opt_argv: an Array of arguments for the AJAX function

function Request(function_name, opt_argv) {

if (!opt_argv)

opt_argv = new Array();

// Find if the last arg is a callback function; save it

var callback = null;

var len = opt_argv.length;

if (len > 0 && typeof opt_argv[len-1] == 'function') {

callback = opt_argv[len-1];

opt_argv.length--;

}

var async = (callback != null);

// Encode the arguments in to a URI

var query = 'action=' + encodeURIComponent(function_name);

for (var i = 0; i < opt_argv.length; i++) {

var key = 'arg' + i;

var val = JSON.stringify(opt_argv[i]);

query += '&' + key + '=' + encodeURIComponent(val);

}

query += '&time=' + new Date().getTime(); // IE cache workaround

// Create an XMLHttpRequest 'GET' request w/ an optional callback handler

var req = new XMLHttpRequest();

req.open('GET', 'http://localhost:8083/rpc?' + query, async);

if (async) {

req.onreadystatechange = function() {

if(req.readyState == 4 && req.status == 200) {

var response = null;

try {

response = JSON.parse(req.responseText);

} catch (e) {

response = req.responseText;

}

callback(response);

}

// Make the actual request

req.send(null);

}

// Adds a stub function that will pass the arguments to the AJAX call

function InstallFunction(obj, functionName) {

obj[functionName] = function() { Request(functionName, arguments); }

}

</script>

Then add the code to call the actual RPC function to the actual page where it's needed and wire it up to the controls. In the case of my app this is the home.html file which is rendered within the body of base.html. It contains a button (doscore) which makes the call to the RPC function and a label (result) which takes the result. So you'll see this is pretty straightforward - the button calls doScore() which then calls server.LiveScore, passing in the subject being searched and the onScoreSuccess function as the callback. The server.LiveScore corresponds to the server side RPC function implemented in the solution.

<script type="text/javascript">

// Server object that will contain the callable methods
var server = {};

// Insert 'Add' as the name of a callable method
InstallFunction(server, 'LiveScore');

// Handy "macro"
function $(id){
return document.getElementById(id);
}

// Client function that calls a server rpc and provides a callback
function doScore() {
$('progress').style.display= 'block';
$('placeholder').style.display= 'none';
$('resultgroup').style.display= 'none';
$('doscore').disabled = true;
server.LiveScore($('subject').value, onScoreSuccess);
}

// Callback for after a successful doAdd
function onScoreSuccess(response) {
$('progress').style.display= 'none';
$('result').innerHTML = response;
$('doscore').disabled = false;
$('resultgroup').style.display= 'block';
}

</script>

<div id="realtimeScore" align="center" class="form-search">
<input id="subject" type="text" placeholder="e.g. Barack Obama" />
<input id="doscore" type="button" value="Get Real-time Score" class="btn" onclick="doScore()" />
</div>
<div style="height:150px">
<div style="height:30px"></div>
<div align="center">
<div class="progress progress-striped active" id="progress" style="width: 10%;height:20px;display:none">
<div class="bar" style="width: 100%">Calculating score...</div>
</div></div>
<div style="height:30px" id="placeholder"></div>
<div id="resultgroup" align="center" style="display:none;height:30px">
<label id="result" class="lead"></label>
<br>
<button class="btn btn-info">How was this calculated?</button>
</div>

Important point to note, which should be obvious, is there's not really a security implementation here. The one I will be implementing is to dynamically create a unique identifier key server side which will then be passed into the web page and subsequently passed back to the RPC function. The RPC function will then verify the unique identifier and expire the key after a time of a few minutes or once it's been used. In short, you should always ensure you have a robust, and well tested, solution for security. On the server side you should also sanitise any requests made to ensure malicious requests that could possibly find their way in are filtered out.

Server side code for handling client-side calls and using deferred tasks for parallel processing

So now we've made the call from the javascript code we need to first handle the call via a generic RPC function on the server. Hence implement the following code (again note no security implementation at this point) in the views.py module.

def rpc(request):

action= request.GET.get('action')

if action=='LiveScore':

a=request.GET.get('arg0')

result=livescore.runSearch(a)

return HttpResponse(json.dumps(result), content_type="text/plain")

So you can see here you could implement multiple functions from here. For example if you wanted to add "CloseAccount" just call server.CloseAccount from the javascript and handle the "CloseAccount" action in the RPC function.

So far so simple. To get interesting we'll now looked at deferred tasks which are a very powerful capability that app engine does for you - so it's very easy to implement in your code. The way I like to think of it is similar to multi-threading, where you can easily control the volume and speed of request processing. First thing you'll need to do is create a queue.yaml file to define the speed and size of the queue processing (full explanation here https://developers.google.com/appengine/docs/python/config/queue). Mine looks like this at the moment:

# Set the total storage limit for all queues to 120MB

total_storage_limit: 120M

queue:

- name: default

rate: 100/s

bucket_size: 40

With this in place you can then use the deferred queuing functionality in your application. This means you can spin up multiple calls and the appengine will do all the work of processing them in parallel, or queue them if the volume of queued processes is too high (based on your settings). All your app has to do is then kick back and get the results. Utilising this functionality means you can implement some pretty awesome functionality - your application can perform a lot of work server side by running in parallel and return the results very quickly to the user. To implement the code, just start by implementing the libraries as follows:

from google.appengine.api import taskqueue
from google.appengine.ext import deferred

Then if you look at this code were performing an url get by calling the function by calling processResult as a deferred task, passing the url in as a parameter. We can make multiple deferred calls like this and they will then all be handled simultaneously by the app engine.

baseurl='https://script.google.com/macros/s/xxxxxxxxxxxxxxx/exec'

url=baseurl + '?sm=' + str(sm) + '&sj=' + search + '&sn=' + sessionid + '&bc=12&bs=' + str(bs) + '&appkey=' + key

deferred.defer(processResult,url)

Social Samplr Survey - $100 Amazon gift voucher up for grabs

As a side note, the new site is coming along nicely. In the meantime, I'm offering a US$100 Amazon gift voucher to one person who fills out the short survey on what I've done so far on Socialsamplr - click the link here. Obviously I will need your email to send the voucher to if you're the lucky winner but otherwise I'd really appreciate anyone who fills it out - should take less than half a minute.

Thanks, until next time.