Saturday 18 May 2013

Using BigQuery and Memcache To Improve Performance

Update on SocialSamplr

Been a while since my last post, but I've been very busy developing our social media scoring platform so it's been hard to find the time to write much.  Things have been progressing very well and socialsamplr is now on-boarded to the Creative HQ business incubator here in Wellington NZ and we're looking forward to accelerating things a lot from here on in.
We've been following the "lean startup" process and think we've got the information now to build a pretty cool and unique minimum viable product so stay tuned on how it progresses.  If anyone is interested in discussing what the product is going to offer or to even be beta testers please contact me at danshannon@socialsamplr.com.  Also, being a cash-strapped startup, if anyone's interested in the source code for the Google apps script stuff I was working on last year I'd be happy to provide this for a small fee along with some training data to create your own sentiment engine (kind of a "build your own sentiment engine starter pack").  Again, just drop me line and we can work something out (here's the link to see that in action https://script.google.com/macros/s/AKfycbxr5a4NUFBTVzxX3GJtW2tbiW3A6K5d06pVo4DqZD3fUT0_pITy/exec).

Otherwise, onto today's post.  As always, to see these stuff in action just visit www.socialsamplr.com.

Speeding up Response Times Using Caching

On the plus side I've been developing with a lot of the really cool features available in App Engine so I thought it'd be a good chance to share some of the "quirks" I've noticed on how to get things running well.  The focus of this blog will be on using memcache, bigquery and deferred tasks to enable your website to process high data loads while maintaining very high speed processing - instant processing of results has almost become the norm these days and so far from our site the experience we've had is that people expect nothing less.


So far in the last post I showed how it's possible to use deferred tasks to multi-thread and achieve improved performance (in our case for processing a large quantity of data quickly from multiple sources).

While the real-time sentiment scoring engine we've built provides a great way to search any subject from social media and get a score on how it is performing, for popular subjects it's far more efficient and user friendly to score them back-end and then cache the data for the user.  Also, the way App Engine is charged using the deferred tasks to run a large number of asynchronous processes does quickly lead to compute time being chewed up.  The screen shot below shows what happens when you fire off a large number of deferred tasks - you wind up with app engine spinning up numerous instances to service all the requests.  Pretty cool technology and also very powerful but like I said it can get a bit expensive on your computing resource.  The other thing to be aware of is being sure you handle your exceptions gracefully in code or else the deferred tasks can attempt multiple retries - again chewing into your compute time usage.



  

Using caching techniques you can effectively provide a close-to instant result which is then enriched further in the background.  The next time another searches for the same subject it then includes the extra data.  The way I think of it is a "virtuous circle" of data being updated in the background so it can be presented instantly to users on the front end.


Big Query

So the first question is where are you going to store the data you need to query - and this is where Google Big Query is perfect.  For the instance I'm describing we're querying up to 7 days of data at any time but even that, with the ambitions we have for the platform, means that we're potentially going to be dealing with some fairly large volumes of data and big query gives us the perfect platform to do this on.  The other option is to use the data store in App Engine but I prefer BigQuery as it has a built in query interface and is easily accessible in other ways if you want to do any further data mash-ups - for example with Google Apps Script.

So the steps I take with big query to load data for caching are as follows.  


  • First ensure the administration account for your application hosted on App Engine has the necessary rights in your Google Apps console to read and write to your instance of BigQuery
  • Test your app engine account has access to big query.  Using the Google App Engine API has the benefit where you won't need to OAuth your account in - it can effectively access Big Query directly.  Here's some sample code you can use to test with.  Otherwise, there's plenty of app engine documentation that covers how to do this.
import httplib2
from google.appengine.api import memcache
from apiclient.discovery import build
from oauth2client.appengine import AppAssertionCredentials

def logResult(message,source,subject):
    credentials = AppAssertionCredentials(
    scope='https://www.googleapis.com/auth/bigquery')
    http = credentials.authorize(httplib2.Http(memcache))
    service = build("bigquery", "v2", http=http)
    jobData={'projectId': 'xxxxxxxxxxx',
             'configuration': { 
                               'query':
                               {'query':  'SELECT  \''+ message + '\' as message,\'' + source + '\' as source,\'' + subject + '\' as subject,NOW() as messagetime', 'destinationTable': 
                                { 'projectId': 'xxxxxxxxx', 'datasetId': 'projectName','tableId': 'tableName'},
                                'createDisposition': 'CREATE_IF_NEEDED','writeDisposition': 'WRITE_APPEND','defaultDataset': 
                                {'datasetId': 'projectName', 'projectId': 'xxxxxxxxx'}}}}
    
    job_runner = service.jobs()
    results = job_runner.insert(projectId="xxxxxxxx",body=jobData).execute()
  • There's a couple of "gotchas" to be aware of here.  One thing I've noticed is when there is a lot of writes to a table in big query - as above - over time it does seem to lead to issues when querying the same table.  In my experience these take the form of "odd" errors that occur in your App Engine logs where it seems to just be unable to read from the table in Big Query.  To resolve this I log all the responses and then have a back-end process running hourly to load that data into a secondary table which will be used for querying - and once we have the data extracted from the logging table we can then go ahead and drop the table.  It leads to some slightly convoluted processing but from what I can see is Big Query is not really designed for transaction based processing and this approach is one that I've tested pretty thoroughly and works well.
  • Once you've established you can write and read to big query from your application you're ready for the next step in the process - using memcache to create a near-instant response for users.

Memcache

Using memcache is a great way to enhance the performance of your application.  Memcache is essentially an in-memory dictionary for the temporary storage of data in a key/value format.  By it's nature it's not stable for ongoing memory storage so it should be regularly cleared and refreshed for consistent application performance.  For the SocialSamplr real-time score we use it for two main purposes.

  1. Provide auto-complete functionality on the site of popular search topics (this is in combination with Twitter Bootstrap "typeahead" function).
  2. Allow pre-cached (refreshed hourly) results to be displayed for popular subjects.  This has the double benefit of greatly enhancing the user experience and also enabling me to control the amount of compute time and resources being used by my application.
So the way I've designed it for use in the real-time score element of SocialSamplr is as follows:

  • I have an hourly back-end process which refreshes the cache by clearing all the data in memcache using the "memcache.flush_all()" command.
  • I then load the necessary data to enable me to provide pre-cached results for the real-time score.  This is by no means all of the data held in BigQuery but just the data I need to provide the functionality.  Remember, bigquery allows aggregate queries to be run so enables you to get that to do a lot of the "heavy lifting" and can minimise the amount of data you need to store in memcache.  
  • To add data to memcache it's simply then a case of calling "memcache.add(key,value,timeout)".
  • The timeout is in milliseconds, but shouldn't really be relied on.  I set the value nice and high and it seems to hold the data in a pretty stable format.
The final piece of the puzzle is then displaying the cached data on the site.  First, for the auto-complete I have the following code in the base html for the page (note I'm using Django for my site, supported on App Engine).


<input id="subject" style="height:80px;width:60%" align="center" type="text"  data-provide="typeahead" data-items="4" data-source='{{getsubjects}}'/>

The {{getsubjects}} part is then replaced with the list of subjects which exists in memcache during the page load.  Subsequently, when a user posts their search request back it knows which key to look for in memcache and return in the results instantly.

So in summary, all fairly straightforward when it's pieced together and provides a powerful means of allowing pre-caching of large and dynamic datasets on your site.  And the best thing of all of course, being in the cloud, no infrastructure set-up to worry about at all.


All the best,
Daniel

No comments:

Post a Comment