Sunburnt: a python-solr interface
Over the last few months, we’ve been hard at work behind the scenes at Timetric, and a few of the results are now to be seen on the website. If you’ve been paying close attention, you might have noticed the appearance of machine tags, and of the ability to search series by value.
These are both reflections of one of the biggest changes we’ve made – we’ve entirely replaced the search infrastructure the site runs on. We’re now backed by Apache Solr, and we’ve written a new Python-Solr interface, called sunburnt.
We let users search using both free text search and drill-down tagging — we used to run these on a combination of postgres-backed full-search text and django-tagging, but this combination wasn’t particularly satisfactory. Unsurprisingly, when you’re trying to add search infrastructure to a site, what you really want is a proper search-engine backend.
For a mature, full-featured, well-supported open-source search engine, the choice boils down to Solr or Xapian. We were strongly tempted by the latter — there’s no shortage of Xapian expertise around Cambridge, but we were swayed by the Apache licensing of Solr, rather than Xapian’s GPL.
And although there are a number of existing Python-Solr interfaces, none of them did what we wanted, which was to provide an intelligent and robust Pythonic API, which lets you pass arbitrary objects in and out of Solr. So we built our own, and called it sunburnt.
Sunburnt is most directly comparable to Haystack, but with a couple of major differences. Firstly, it’s not restricted to Django model data, and secondly, it’s schema-driven rather than schema-generating — it lets you construct your own Solr schema, and automatically derives all the type-checking/conversion/coercion code necessary to map your objects to and from the Solr index, when constructing queries and exchanging data.
The only documentation at the moment is the examples below, but the code is all up on Github. Patches and contributions are more than welcome!
Sunburnt in use
To start indexing & querying, you initialize a SolrInterface with your schema. At the moment, you need to do this by passing in the schema xml — sunburnt won’t query the Solr server for its schema.
solr_interface = sunburnt.SolrInterface("http://localhost:8983", "schema.xml")
To index objects, add() them to the interface. sunburnt doesn’t care what form the data comes in, so long as
- if it looks like an object, it has attributes named according to fields defined in the schema
- if it looks like a dictionary, it has keys named according to fields defined in the schema
class Document(object): def __init__(self, title, contents): self.title = title self.contents = contents documents = [ {"title":"This is a dictionary", "contents":"Lorem ipsum"}, Document("This is an object", "dolor") ] solr_interface.add(documents)
If you haven’t set your Solr instance up to do autocommit, then you might want to commit your documents to the index:
solr_interface.commit()
after which the documents are searchable. The API is fairly close to that offered by Haystack (and indeed Django’s QuerySet) – unsurprisingly, since they’re solving similar problems.
solr_interface.query("This")
does what you might expect, searching on the default field.
solr_interface.query("This").filter("dictionary")
while chaining with filter() allows you to choose which parts of your queries are cached by Solr.
For fields representing numbers, or dates, then searching by range is useful, for example
solr_interface.query("This", last_modified__gt="2009-01-01")
if you have a last_modified field in your schema. Queries can be faceted – if you had tags on your objects, you might do this:
solr_interface.query("This").facet_by("tags", limit=20, mincount=1)
and if you wanted to search for similar documents, you can do a more-like-this query (in this case, looking for similarity in the tags field)
solr_interface.query("This").mlt("tags")
sunburnt doesn’t support all of the Solr API, but it gives you access to a goodly portion, and all of these operations are chainable.
solr_interface.query("This").filter("dictionary").\ facet_by("tags", limit=20, mincount=1).\ query(last_modified__gt="2009-01-01").paginate(rows=10)
Finally, having set up a query, you can get a result object back by execute()ing the query. This bit of the API is still a bit rough around the edges, but
r = solr_interface.query(...).execute() r.result # has the main query results r.facet_counts # has the faceting results r.more_like_these # has any more_like_these results
and if you poke around in that object, it has all the rest of the information that Solr provides.
Sunburnt in practice
The code is now running live on Timetric, and is problem-free for us. We’ve been able to throw away scads of code working around djangosearch/django-tagging shortcomings, and performance is significantly faster all round, especially for anything regarding tagging. Most usefully though, it’s provided us with a platform to start experimenting with new navigation features much more rapidly.
Postscript: djangosearch/django-tagging shortcomings
djangosearch, though easy to set up (if you’re already using postgres), offers very little in the way of control over various parameters and options you might want to tune, and requires filtering/escaping of some queries. (Searching for a string with “£” in it causes interesting errors!)
django-tagging had slightly more serious issues;
- we had to maintain our own fork of the codebase due to a couple of long-standing issues; corner cases which upstream weren’t interested in fixing (One that kept biting us was its lack of support for models with non-integer primary keys. Easily fixed, but support is still not included in upstream django-tagging).
- extending its functionality turned out to involve writing a lot of hand-tuned and often inherently slow SQL. Writing related-tags functionality was particularly painful – it involves inverting the index, which is very time-consuming – we had to do that offline.
OAuth 1.0a and autodiscovery
OAuth 1.0a
As of last Monday, we’ve upgraded timetric.com to cope with the OAuth 1.0a workflow.
If you follow these things, you can hardly have avoided noticing that there was a big fuss in April this year, when a vulnerability in the OAuth protocol was discovered. When it was made public, it turned out to be less a technical than a social vulnerability. The OAuth workflow involves several transactions, and the exchange of multiple tokens. In version 1.0, there was an opportunity for a malicious third party to step into the exchange, and by tricking the end user (essentially, by phishing them into clicking a link) gain their credentials.
Still, social vulnerabilities are as important as technical ones, and the OAuth team rapidly developed version 1.0a of the workflow which avoids the problem. In the interim, and since upgrading existing servers and clients is hard work, and since the issue can be mitigated by anti-phishing provisions, it’s been standard practice to carry on supporting the 1.0 workflow, while attaching big warnings everywhere, and that’s what we’ve been doing.
However, we’ve now finished implementing a 1.0a-compliant server for timetric, so that 1.0a-capable clients can take advantage of the improved workflow. But because most clients don’t yet support 1.0a, our server currently supports both 1.0 and 1.0a transactions. Doing this has involved borrowing from, and extending, both python-oauth & django-oauth. We’ve fed our changes to both the upstream authors of both projects, we’ve tested our codebase, and it’s now running live on timetric.
(As of writing, our changes haven’t yet made it into the upstream version of django-oauth, but you can get a hold of what we’ve done from our fork on github.)
We’ll continue to support the 1.0 workflow for the immediately-foreseeable future, but obviously at some point we’ll want to retire it in favour of the more secure 1.0a. For those of you who’ve written OAuth clients , I can highly recommend this blog post as a very nice overview of the changes in the workflow, and what you need to do to 1.0a-enable a client.
OAuth autodiscovery
While I was poking around with the OAuth code, I also managed to address a niggle I’ve had for a long time with OAuth. Oauth uses three separate URL endpoints to manage the token request/exchange process. These need to be published somewhere, and then any OAuth clients need to know these service-specific URLs. This is annoying; practically, because it makes it hard to write a generic OAuth client framework, and also it offends the RESTian purist in me – resources should be machine-discoverable, dammit.
Anyway, it turns out that there is an experimental OAuth auto-discovery spec, which piggybacks off the XRDS resource discovery scheme. It’s not final, and it seems there’s not a lot of active development on it, but I thought I’d try it out anyway. Having implemented it as an experiment, I’m actually quite happy with it. All timetric OAuth resources are now completely auto-discoverable, knowing nothing but the xrds mimetype.
The workflow goes like this: firstly, ask for the location of the XRDS resource description, by using content-negotiation on whatever OAuth-protected resource you’re trying to gain access to:
Request:
GET /resource-of-interest
Accept: application/xrds+xml
Response:
HTTP/1.1 302 Found
X-XRDS-Location: /xrds.xrds
Location: http://timetric.com/xrds.xrds
[...]
then follow the redirect;
Request:
GET /xrds.xrds
Response:
HTTP/1.1 200 OK
Content-Type: application/xrds+xml
[...]
<?xml version="1.0" encoding="UTF-8"?>
<XRDS xmlns="xri://$xrds">
[...]
The XRDS file has a well-defined XML format, and the client can parse it to pull out the location of the OAuth endpoints. This means that you can write a very generic OAuth client library; all the library needs to be told is the location of any interesting oauth-protected resources, and now it can find out everything it needs to negotiate the OAuth workflow. Helpfully, you can also use the XRDS file to advertise which forms of OAuth negotiation you support – parameters in the URI, or as HTTP headers, different signature schemes, and so on.
I’ve had an immediate benefit, because now it’s made my test framework much simpler – I don’t need to store arbitrary strings denoting my OAuth URLs, nor manipulate them every time I run tests against differently-named test servers. Since the spec isn’t final, this scheme is obviously liable to change – but it’s a nice example of how to make a service machine-discoverable.
API improvements
As Dan alluded to yesterday, this week we made a new API release.
Previously our API was basically only let you add and retrieve data. This has been useful to a whole lot of people, but there’s much more that you can do with Timetric.
The new release involves several features. There’s a bit of improvement to existing functionality to make life a bit easier when uploading data; but more excitingly, we’ve opened up access to even more of the capabilities of the timetric platform.
Search endpoints
When building applications on top of Timetric, one of things we’ve been asked for is the ability to retrieve lists of relevant data. This might simply be to get hold of all of your own series, or it might be a list of tagged series, or it might be a complex search query.
For all of these, we’ve exposed search endpoints that let you do powerful queries across our data. You can search through the full text of our titles and descriptions, over tags, and by user. This means you can build much more useful interactive interfaces on top of Timetric. In fact, these are exactly the same endpoints that timetric.com uses internally when you browse our data.
Calculated series
Through the timetric.com website, you’ve always had the ability to build model calculations, and to filter series. We’ve now exposed this at the API level as well, so you can build these models and filters programmatically.
Cross-domain requests
If you’re a web developer, you’ll be all too familiar with the headaches of restrictions on cross-domain requests. In many cases, there are perfectly good security-related reasons for them, but these restrictions make writing some web applications much harder than it ought to be.
Fortunately, the newest generation of browsers (Firefox 3.5, IE8, and Safari 4) let you make secure cross-domain requests directly — so long as the server supports it (see https://developer.mozilla.org/en/HTTP_access_control). Since this is such a useful feature — for us as much as anyone else – we’ve enabled it so you can use it too, and build much more exciting Timetric mashups in modern browsers.
Easier uploading
And finally, we had feedback from several people about ways in which we could make pushing data into the platform through the API a bit easier. The details are probably uninteresting unless you like constructing HTTP messages yourself (which I do, but it’s not everyone’s cup of tea!) so I’ll simply point you at the new documentation. In short, you can POST data directly, rather than having to multipart-encode it.
So …
If you’re a developer, get out there and play! We’re always happy to get any feedback – positive or negative!
Timetric’s New Logo
It’s been a busy month in Timetric Towers, so this post is waaay overdue, but I really want to highlight the excellent new logos designed for us by Kate Abbass (@kateabbass).

Upgrading the Timetric backend.
As you might have noticed, this week involved a flurry of activity at Timetric HQ – on Wednesday we pre-announced downtime for a database upgrade on Friday, but we ended up accelerating our schedule, and doing the upgrade on Thursday instead. We thought it might be useful to offer an explanation of what was going on!
The backstory is that Timetric runs on a pair of databases for its backend. One is a traditional RDBMS (Postgres, as it happens), which is used for storing all metadata; the other is a non-relational DB, used for storing all our timeseries data. It’s the latter which was at issue – we needed to make the fairly major change from using HBase to using Tokyo Cabinet/Tyrant.
When we first started building Timetric, we knew we wanted to use a non-relational DB for our timeseries data. There’s a number of options out there (for a recent overview, see Bob Ippolito’s talk “Drop ACID and think about data” at PyCon 2009). We did some initial experimentation, and ended up going with HBase. It seemed like a good match – it has timestamped versioning for all its data, which seemed to fit data that’s inherently time-based; it’s a high-profile project, at the time just adopted by Apache; it’s got a lively, helpful developer community; the codebase seemed relatively robust, and was comprehensible (so much so that despite not really being a Java programmer, I was able to offer a few minor patches easily enough). And it has a very nice scaling story, up to billions of rows, which is nice to have in reserve!
Its major downside was that its performance wasn’t really yet ready for use behind an interactive website – nevertheless, we were still in early alpha at the time. Performance improvements were high up the developer team’s wishlist, and we didn’t seem to be the only people interested in using it as backing store for a web application, so we had hopes that things would improve. and, indeed they did – current HBase performance is significantly better than it was 6 months ago.
Nevertheless, a few weeks ago, we decided that we couldn’t carry on working with HBase indefinitely; we took another look around, and made the choice to migrate to Tyrant.
There were a few reasons for this:
- HBase performance was improving, but it was becoming apparent that the loads we were placing on it were atypical; it was coping, but we were having to do some relatively baroque optimizations within the web application layer to get acceptable performance for large datasets.
- We’ve had people getting in touch about running private Timetric instances. For these, having one single, huge, database for holding all our data is less important – rather, we need multiple, per-instance databases. Scaling at the high-end is less important, and management at the low-end is more important; HBase isn’t trivial to manage – it’s got a lot of moving parts, and automating the management of large-scale deployments is a complex task (see, for example, SmartFrog).
- Most importantly, when you’re managing multiple instances, reliability becomes a far greater concern. Not that it’s more important, but there are more points of failure; managing that reliably is hard. HBase is a large system, with a relatively small user-base – while we’ve been in a position to deal with running it behind Timetric.com, we weren’t confident in our ability to do so behind multiple instances of the platform, for multiple customers.
Fundamentally, we need a much simpler, more easily manageable, and faster solution. Fortunately, Tyrant fulfils all of these criteria for us.
- It’s astonishingly fast – so much so that we’ve actually switched memcached off in a number of situations, because it’s quicker simply to get the data straight from the source. (This is partly due to the fact that Django’s interface to memcached requires you to use pickle, which is orders of magnitude slower than using simplejson for simple data structures).
- Setup is ridiculously simple. Honestly, why can’t all databases be this simple? Compilation is a bog-standard “./configure && make && make install”, there’s literally no configuration necessary to get a database going which is optimised for most common use cases, and there’s one command to start & stop it.
- Despite some reasonably hard pushing at it, we’ve had no hint of data corruption – or even transaction failures – against Tyrant. And the knowledge that it’s been heavily stress-tested elsewhere gives a nice warm fuzzy feeling.
So, over the last few weeks, we’ve been planning this changeover; writing the new backend interface, testing, bugfixing, load-testing, optimizing, etc., with the aim of making the changeover today.
However, yesterday morning, events overtook us. The server logs started showing worrying error messages – the sort you really don’t like to see – about missing data. On further investigation, it turned out that our HBase instance was dropping data on the floor, left, right and centre.
Fortunately, we keep good backups!! I’ll emphasise that before going any further. (I should also note that we still have no idea why the data corruption started happening – although clearly it was HBase dropping the data, I’m not laying all the blame on its head; we haven’t had time to find out what might have prompted the problem.)
We deliberated briefly about what to do – we knew we had backups, and could restore them to HBase – we’ve tested that process before, so we knew how long it would take, a matter of a few hours. On the other hand, we’d been planning to do the migration the next day anyway. The migration would take a bit longer than simply restoring the backups; and because we’d been forced into it, would take slightly longer than was originally planned. On the whole, though, we thought it was best to cut our losses and kickstart the migration immediately.
So, the result was what you saw yesterday! Stress levels were fairly high all day (not helped by one of our number being in self-imposed quarantine – Dan came down with a cold (not swine flu!) yesterday morning, so kept himself at home, though he was working hard online. Nevertheless, the migration went almost entirely smoothly; much better than the worst case scenarios I’d dreamt up, and as you saw we were up again before the day was out.
And the result is fairly impressive, I think. Currently, most of the improvements are either fairly well-hidden, or amount to support for additional features which we’ll be building out in the next short while. But the most noticeable gain is that the whole site is much snappier now – page load times, especially for large series, have dropped dramatically. Upload times are hugely improved as well, which is letting us do some quite exciting things.