At VendAsta we develop, among other things, social networking applications deployed to Google AppEngine. One of the aspects of our suite is a feature where users (e.g., Facebook users) can trust experts from our StepRep software.

A desirable, and obvious, feature is to identify the experts that my friends trust, or conversely, which of my friends trust a particular set of experts. From a use case perspective, a lot of this information is jammed on to the screen at one time; that is, we cannot simply query for a single friend or a single expert.

This is a classic graph walk style problem that is relatively easy to scale in a parallel processing environment, but nasty where you can only do serialized queries.

Initially, we felt a bit limited by the AppEngine query model. It is very fast, but only allows for serialized queries.

In reaction, we began investigations into using a MapReduce mechanism to compute these “trust networks” offline. Specifically, we were planning on using MapReduce deployed into Amazon’s Elastic MapReduce engine. The results of these computations were to be stored on Amazon’s S3 (as JSON docs) and simply accessed statically from our AppEngine applications.

This solution, while sound, suffered some drawbacks. We had a difficult time with dealing with the recomputation of the trust networks. We could choose to recompute them daily, but this was wasteful of resources: many of the networks would be completely unchanged. Alternatively, we could queue jobs when new trusts are formed and process just those, but then we have to develop another job to determine which trust networks cascade from the new trust and need recomputation. The absolute nail in the coffin of this approach is that we (typically) use Facebook to define friend networks and their T&C do not allow the storage of friend relationships for any great amount of time, making offline processing difficult and incomplete at best.

We needed another solution that allowed for the real time computation of trust networks.

An architect at our company, Kevin Pierce, also happens to be one of the best reverse engineers I know. He started poking around the gory depths of the AppEngine source and discovered that all of the API calls stub out in MakeSyncCall. This inevitably led to the discovery of the partner MakeCall which yields an RPC object that you can wait on. Of course, if I can wait on one RPC object, I can also wait on many of them. A subsequent release of AppEngine capitalized on this fact allowing the developer to easily make parallel UrlFetch requests. There were some earlier code libraries that also facilitated parallel UrlFetch requests leveraging the same mechanism.

At Google I/O, Ryan Baldwin and I had the opportunity to corner Ryan Barrett (AppEngine’s supersmart Data Store guru) and chat with him about using this approach for other API interaction. Ryan seemed pretty excited about our exploration of this area and indicated that this was something they had intended to expose to developers but, like any successful software project, competing priorities kept getting in the way.

Encouraged, we set about developing code to allow developers to queue up a set of AppEngine API calls and then kick them off in parallel. In particular, for the use case that we were looking to solve was the ability to perform multiple Data Store Queries in parallel; this would allow us to query for each friend’s trusts in parallel and collect the results together.

An example invocation might look something like this:


# set up the async queries
runner = MultiTask()
for uid in user_ids:
    query = db.GqlQuery("SELECT __key__ FROM Account WHERE facebook_id = :1", uid)
    runner.append(QueryTask(query, limit=1, client_state=uid))

# kick off the work
runner.run()

# peel out the results
for task in runner:
    task_result = task.get_result() # will raise any exception that occurred for the given query
    print '%s: %s' % (task.client_state, task_result[0])

Here, we’re making multiple parallel queries, though the mechanism allows for mixing around different API calls.

As an aside, this can yield some pretty scary/awesome looking values in the logs files:

Three seconds can chew up a lot of API time!

Three seconds can chew up a lot of API time!

I’m excited to announce that we’ve bundled up this code into a package and released it to the wild on Google Code as asynctools. So far, we’ve wrapped Query (including GqlQuery) and UrlFetch. Others will come as we have time/motivation. Note that you can combine these together and execute arbitrary combinations of them in parallel.

Send us any feedback, comments, encouragement you have.



8 Responses to “asynctools – A Tale of Two Queries”  

  1. awesome work! we’re all really excited to see this open sourced and available for everyone to use. congratulations on the launch!

  2. 2 Jason Collins

    Thanks Ryan! It’s already been a really important tool in the toolkit, though admittedly, if you get lazy, you’re tempted to use it in places where an alternate Entity Model would yield a better result. It’s sort of like a big cannon that you should aim and shoot carefully and sparingly.

  3. 3 Tim

    This is awesome ! Really smart approach to one of the issues we face too often. Thumbs up for making it opensource. Thanks !

  4. Keep in mind the api_cpu_ms posted is a very simple calculation for Google:

    Retrieve given id/key name: 12 ms * results
    Query: 17.5 ms + 13.5 ms * results
    Inserts/Updates/Deletes: ~3000 ms + ~350 ms * entities

    In other words, do requests in parallel, sure, but be careful to issue inserts/updates/deletes in batches.

  5. Thanks for making this public!! Really looking forward to trying it out today.

  6. Wow, this is great. I can’t wait to try it out in http://code.google.com/p/geomodel . This seems like it can drastically improve performance of IN queries, which I rely on heavily.


  1. 1 VendAsta wows the Google App Engine world with asynctools. « The VendAsta Blog
  2. 2 Google: App Engine on Parallel Bars Goes for Gold : Beyond Search

Leave a Reply