Firetweets : the making of.

February 23, 2010

Is there a Twitter aux. service that takes any search term like “node.js” and creates list of hottest & recent links tweeted for that term?

This tweet is where the idea for Firetweets originates. 20 minutes – and a dinner – later the first version was online. While the look of the site hasn’t changed much, the code that powers went through several iterations. I thought it’d be of some interest, not to share the actual code, but rather the thinking behind it. It’s no rocket-science, identifying shortcomings and being able to solve them, is a favorite activity of mine :)

1st try

The first iteration was extremely simple. It consisted of making a call to the Twitter API asking for all tweets containing a link and our query (node.js).

For each result:

Even though it worked as expected, this is a rather naive approach. Here are a few reasons why:

2nd try

Learning from the shortcomings of the first iteration, some things needed to change:

First thought was to use the API the URL shortener services provide. Even though a handful of them dominate the market, we’re dealing with a much larger number of services, from bit.ly to ff.im, tinyurl, and not all of them have an API, and for those who do, they, unfortunately, didn’t think it’d be a good idea to use a standard format. Atom or RSS anyone ?

The good news is, that despite the lack of a standard json/xml stream, those services aren’t really disparate, they all do the same thing: provide a short link that will redirect to the final URL. And this, is perfect for our use case: let’s just follow the URL, which will ultimately redirect us to the page we’re looking for.

Since our app is running on top of Google App Engine, we can use the urlfetch module that is bundled with the SDK. The result of fetching an URL with urlfetch has a final_url property, which is exactly what we’re looking for. If the value of the final_url property is the actual URL whose request returned this response. Sweet! While we’re at it, let’s grab the content of the page, and with some very basic regexp extract its title. We’re almost there, we just need to clean the URL by removing tracking and other parameters and we’ll have our unique identifier.

To keep things simple, when fetching the data from the Twitter API, we’re only asking for the latest 50 results that contain both a link and our query. Even though we’re keeping track of the last fetched tweet_id, for further calls, in the worst case scenario, we will have 50 short links which we need to convert.

Since we want to regularly retrieve new links from the Twitter API, we’ll use scheduled tasks, also called crons. On App Engine, crons invoke URLs. Which means that the steps mentioned above will be happening in the lifespan of a request. And you know, that’s far from being ideal, it’s even a terrible idea. Why, you ask? Because we’re dealing with third-parties here, up to 50 of them. If just one of them is terribly slow or down, the request could time out before we’ve finished processing all our input. And this, we don’t want. But that’s easy you say, let’s make asynchronous requests, that’ll be much faster, and we won’t timeout! Or not. Since we’ll process requests in parallel, we’ll definitely have a speed bump in indexing all our URLs. But, we’re still bound to the slowest request, which means that we’ve no guarantee that this single call won’t make the cron request time out. Damn! We need to find a way to move these operations out of the initial request processing, and execute them in the background. Fear not my friend, The App Engine team as released a task queue API, which makes offline processing ridiculously easy. I’ve opted for the deferred library. Here’s an example:

for link in links:
  deferred.defer(tools.get_title, link, tag, _countdown=30)

It can hardly be simpler. We’ll let App Engine take care of executing our tasks in the background. Our Links will automagically appear in the Datastore once the final URL and the title of the page has been retrieved, and the count incremented.

3rd try (in progress)

You’ve probably noticed that keeping track of hot links by just incrementing a counter of an object won’t get us very far. We’re not logging enough information to be able to extract stats, patterns and such. That’s what I’m working on right now, when not procrastinating on Twitter / Delicious ;)

Conclusion

It’s a only matter of minutes to have a prototype running on App Engine. But there’s a gap between a prototype and a useful and perennial service, and it sure helps to think it through first. Unless you love data migration of course ;)

References:

Older posts...

You can also browse the archives or go home

About

Hi, I’m Tim. I’m a Software Engineer at Formspring.me. You can read more about me or follow @pims on Twitter or ask me almost anything on Formspring.me