Firetweets : the making of.

February 23, 2010

Is there a Twitter aux. service that takes any search term like “node.js” and creates list of hottest & recent links tweeted for that term?

This tweet is where the idea for Firetweets originates. 20 minutes – and a dinner – later the first version was online. While the look of the site hasn’t changed much, the code that powers went through several iterations. I thought it’d be of some interest, not to share the actual code, but rather the thinking behind it. It’s no rocket-science, identifying shortcomings and being able to solve them, is a favorite activity of mine :)

1st try

The first iteration was extremely simple. It consisted of making a call to the Twitter API asking for all tweets containing a link and our query (node.js).

For each result:

extract link from text
query datastore for an object with an id equal to the link we’ve extracted
if there is a match, increment its count by 1 and save
if there isn’t, create a Link object, set its id to the URL found, set count at 1 and save

Even though it worked as expected, this is a rather naive approach. Here are a few reasons why:

URLs are shortened by different services, which means that two different URLs could link to the same resource. Duplicates would ruin the whole rank by popularity aspect.
tracking parameters or preferences in URLs (?utm=twitter&campaign� or ?view=print) generate duplicates
both previous points combined
short URLs aren’t easy to remember

2nd try

Learning from the shortcomings of the first iteration, some things needed to change:

short URLs must be resolved to avoid duplicates
expanded URLs must be stripped of tracking and other polluting parameters.
links need a context: title or tweet from which they have been extracted

First thought was to use the API the URL shortener services provide. Even though a handful of them dominate the market, we’re dealing with a much larger number of services, from bit.ly to ff.im, tinyurl�, and not all of them have an API, and for those who do, they, unfortunately, didn’t think it’d be a good idea to use a standard format. Atom or RSS anyone ?

The good news is, that despite the lack of a standard json/xml stream, those services aren’t really disparate, they all do the same thing: provide a short link that will redirect to the final URL. And this, is perfect for our use case: let’s just follow the URL, which will ultimately redirect us to the page we’re looking for.

Since our app is running on top of Google App Engine, we can use the urlfetch module that is bundled with the SDK. The result of fetching an URL with urlfetch has a final_url property, which is exactly what we’re looking for. If the value of the final_url property is the actual URL whose request returned this response. Sweet! While we’re at it, let’s grab the content of the page, and with some very basic regexp extract its title. We’re almost there, we just need to clean the URL by removing tracking and other parameters and we’ll have our unique identifier.

To keep things simple, when fetching the data from the Twitter API, we’re only asking for the latest 50 results that contain both a link and our query. Even though we’re keeping track of the last fetched tweet_id, for further calls, in the worst case scenario, we will have 50 short links which we need to convert.

Since we want to regularly retrieve new links from the Twitter API, we’ll use scheduled tasks, also called crons. On App Engine, crons invoke URLs. Which means that the steps mentioned above will be happening in the lifespan of a request. And you know, that’s far from being ideal, it’s even a terrible idea. Why, you ask? Because we’re dealing with third-parties here, up to 50 of them. If just one of them is terribly slow or down, the request could time out before we’ve finished processing all our input. And this, we don’t want. But that’s easy you say, let’s make asynchronous requests, that’ll be much faster, and we won’t timeout! Or not. Since we’ll process requests in parallel, we’ll definitely have a speed bump in indexing all our URLs. But, we’re still bound to the slowest request, which means that we’ve no guarantee that this single call won’t make the cron request time out. Damn! We need to find a way to move these operations out of the initial request processing, and execute them in the background. Fear not my friend, The App Engine team as released a task queue API, which makes offline processing ridiculously easy. I’ve opted for the deferred library. Here’s an example:

for link in links:
  deferred.defer(tools.get_title, link, tag, _countdown=30)

It can hardly be simpler. We’ll let App Engine take care of executing our tasks in the background. Our Links will automagically appear in the Datastore once the final URL and the title of the page has been retrieved, and the count incremented.

3rd try (in progress)

You’ve probably noticed that keeping track of hot links by just incrementing a counter of an object won’t get us very far. We’re not logging enough information to be able to extract stats, patterns and such. That’s what I’m working on right now, when not procrastinating on Twitter / Delicious ;)

Conclusion

It’s a only matter of minutes to have a prototype running on App Engine. But there’s a gap between a prototype and a useful and perennial service, and it sure helps to think it through first. Unless you love data migration of course ;)

References:

Firetweets
Background work with the deferred library by Nick Johnson
The URL Fetch Python API
Twitter Search API
Bit.ly API
Tweet from which the service was born by John Wright