Sunburnt: a python-solr interface

Over the last few months, we’ve been hard at work behind the scenes at Timetric, and a few of the results are now to be seen on the website. If you’ve been paying close attention, you might have noticed the appearance of machine tags, and of the ability to search series by value.

These are both reflections of one of the biggest changes we’ve made – we’ve entirely replaced the search infrastructure the site runs on. We’re now backed by Apache Solr, and we’ve written a new Python-Solr interface, called sunburnt.

We let users search using both free text search and drill-down tagging — we used to run these on a combination of postgres-backed full-search text and django-tagging, but this combination wasn’t particularly satisfactory. Unsurprisingly, when you’re trying to add search infrastructure to a site, what you really want is a proper search-engine backend.

For a mature, full-featured, well-supported open-source search engine, the choice boils down to Solr or Xapian. We were strongly tempted by the latter — there’s no shortage of Xapian expertise around Cambridge, but we were swayed by the Apache licensing of Solr, rather than Xapian’s GPL.

And although there are a number of existing Python-Solr interfaces, none of them did what we wanted, which was to provide an intelligent and robust Pythonic API, which lets you pass arbitrary objects in and out of Solr. So we built our own, and called it sunburnt.

Sunburnt is most directly comparable to Haystack, but with a couple of major differences. Firstly, it’s not restricted to Django model data, and secondly, it’s schema-driven rather than schema-generating — it lets you construct your own Solr schema, and automatically derives all the type-checking/conversion/coercion code necessary to map your objects to and from the Solr index, when constructing queries and exchanging data.

The only documentation at the moment is the examples below, but the code is all up on Github. Patches and contributions are more than welcome!

Sunburnt in use

To start indexing & querying, you initialize a SolrInterface with your schema. At the moment, you need to do this by passing in the schema xml — sunburnt won’t query the Solr server for its schema.

solr_interface = sunburnt.SolrInterface("http://localhost:8983", "schema.xml")

To index objects, add() them to the interface. sunburnt doesn’t care what form the data comes in, so long as

class Document(object):
    def __init__(self, title, contents):
        self.title = title
        self.contents = contents
 
documents = [
   {"title":"This is a dictionary", "contents":"Lorem ipsum"},
   Document("This is an object", "dolor")
]
 
solr_interface.add(documents)

If you haven’t set your Solr instance up to do autocommit, then you might want to commit your documents to the index:

solr_interface.commit()

after which the documents are searchable. The API is fairly close to that offered by Haystack (and indeed Django’s QuerySet) – unsurprisingly, since they’re solving similar problems.

solr_interface.query("This")

does what you might expect, searching on the default field.

solr_interface.query("This").filter("dictionary")

while chaining with filter() allows you to choose which parts of your queries are cached by Solr.

For fields representing numbers, or dates, then searching by range is useful, for example

solr_interface.query("This", last_modified__gt="2009-01-01")

if you have a last_modified field in your schema. Queries can be faceted – if you had tags on your objects, you might do this:

solr_interface.query("This").facet_by("tags", limit=20, mincount=1)

and if you wanted to search for similar documents, you can do a more-like-this query (in this case, looking for similarity in the tags field)

solr_interface.query("This").mlt("tags")

sunburnt doesn’t support all of the Solr API, but it gives you access to a goodly portion, and all of these operations are chainable.

solr_interface.query("This").filter("dictionary").\
    facet_by("tags", limit=20, mincount=1).\
    query(last_modified__gt="2009-01-01").paginate(rows=10)

Finally, having set up a query, you can get a result object back by execute()ing the query. This bit of the API is still a bit rough around the edges, but

r = solr_interface.query(...).execute()
r.result # has the main query results
r.facet_counts # has the faceting results
r.more_like_these # has any more_like_these results

and if you poke around in that object, it has all the rest of the information that Solr provides.

Sunburnt in practice

The code is now running live on Timetric, and is problem-free for us. We’ve been able to throw away scads of code working around djangosearch/django-tagging shortcomings, and performance is significantly faster all round, especially for anything regarding tagging. Most usefully though, it’s provided us with a platform to start experimenting with new navigation features much more rapidly.

Postscript: djangosearch/django-tagging shortcomings

djangosearch, though easy to set up (if you’re already using postgres), offers very little in the way of control over various parameters and options you might want to tune, and requires filtering/escaping of some queries. (Searching for a string with “£” in it causes interesting errors!)

django-tagging had slightly more serious issues;

Government 2010

Good morning! Andrew here.

At Timetric, we’re on a mission to get the world’s statistics to you in a form which you can use. A lot of those numbers start their lives in local and national governments, so part of the job is talking, and working, with people in government.

Today is the Government 2010 conference in London, and I’m going to be there. What’s more, I’m going to be covering it live all day on Twitter at @walkingshaw and over on Davepress. It’ll be much easier with your help, though! Have a look at the agenda, and if you’ve got anything you’d like me to cover in more detail or to relay to the conference, please get in touch.

OAuth 1.0a and autodiscovery

OAuth 1.0a

As of last Monday, we’ve upgraded timetric.com to cope with the OAuth 1.0a workflow.

If you follow these things, you can hardly have avoided noticing that there was a big fuss in April this year, when a vulnerability in the OAuth protocol was discovered. When it was made public, it turned out to be less a technical than a social vulnerability. The OAuth workflow involves several transactions, and the exchange of multiple tokens. In version 1.0, there was an opportunity for a malicious third party to step into the exchange, and by tricking the end user (essentially, by phishing them into clicking a link) gain their credentials.

Still, social vulnerabilities are as important as technical ones, and the OAuth team rapidly developed version 1.0a of the workflow which avoids the problem. In the interim, and since upgrading existing servers and clients is hard work, and since the issue can be mitigated by anti-phishing provisions, it’s been standard practice to carry on supporting the 1.0 workflow, while attaching big warnings everywhere, and that’s what we’ve been doing.

However, we’ve now finished implementing a 1.0a-compliant server for timetric, so that 1.0a-capable clients can take advantage of the improved workflow. But because most clients don’t yet support 1.0a, our server currently supports both 1.0 and 1.0a transactions. Doing this has involved borrowing from, and extending, both python-oauth & django-oauth. We’ve fed our changes to both the upstream authors of both projects, we’ve tested our codebase, and it’s now running live on timetric.

(As of writing, our changes haven’t yet made it into the upstream version of django-oauth, but you can get a hold of what we’ve done from our fork on github.)

We’ll continue to support the 1.0 workflow for the immediately-foreseeable future, but obviously at some point we’ll want to retire it in favour of the more secure 1.0a. For those of you who’ve written OAuth clients , I can highly recommend this blog post as a very nice overview of the changes in the workflow, and what you need to do to 1.0a-enable a client.

OAuth autodiscovery

While I was poking around with the OAuth code, I also managed to address a niggle I’ve had for a long time with OAuth. Oauth uses three separate URL endpoints to manage the token request/exchange process. These need to be published somewhere, and then any OAuth clients need to know these service-specific URLs. This is annoying; practically, because it makes it hard to write a generic OAuth client framework, and also it offends the RESTian purist in me – resources should be machine-discoverable, dammit.

Anyway, it turns out that there is an experimental OAuth auto-discovery spec, which piggybacks off the XRDS resource discovery scheme. It’s not final, and it seems there’s not a lot of active development on it, but I thought I’d try it out anyway. Having implemented it as an experiment, I’m actually quite happy with it. All timetric OAuth resources are now completely auto-discoverable, knowing nothing but the xrds mimetype.

The workflow goes like this: firstly, ask for the location of the XRDS resource description, by using content-negotiation on whatever OAuth-protected resource you’re trying to gain access to:

Request:
GET /resource-of-interest
Accept: application/xrds+xml

Response:
HTTP/1.1 302 Found
X-XRDS-Location: /xrds.xrds
Location: http://timetric.com/xrds.xrds
[...]

then follow the redirect;

Request:
GET /xrds.xrds

Response:
HTTP/1.1 200 OK
Content-Type: application/xrds+xml
[...]

<?xml version="1.0" encoding="UTF-8"?>
<XRDS xmlns="xri://$xrds">
[...]

The XRDS file has a well-defined XML format, and the client can parse it to pull out the location of the OAuth endpoints. This means that you can write a very generic OAuth client library; all the library needs to be told is the location of any interesting oauth-protected resources, and now it can find out everything it needs to negotiate the OAuth workflow. Helpfully, you can also use the XRDS file to advertise which forms of OAuth negotiation you support – parameters in the URI, or as HTTP headers, different signature schemes, and so on.

I’ve had an immediate benefit, because now it’s made my test framework much simpler – I don’t need to store arbitrary strings denoting my OAuth URLs, nor manipulate them every time I run tests against differently-named test servers. Since the spec isn’t final, this scheme is obviously liable to change – but it’s a nice example of how to make a service machine-discoverable.

API improvements

As Dan alluded to yesterday, this week we made a new API release.

Previously our API was basically only let you add and retrieve data. This has been useful to a whole lot of people, but there’s much more that you can do with Timetric.

The new release involves several features. There’s a bit of improvement to existing functionality to make life a bit easier when uploading data; but more excitingly, we’ve opened up access to even more of the capabilities of the timetric platform.

Search endpoints

When building applications on top of Timetric, one of things we’ve been asked for is the ability to retrieve lists of relevant data. This might simply be to get hold of all of your own series, or it might be a list of tagged series, or it might be a complex search query.

For all of these, we’ve exposed search endpoints that let you do powerful queries across our data. You can search through the full text of our titles and descriptions, over tags, and by user. This means you can build much more useful interactive interfaces on top of Timetric. In fact, these are exactly the same endpoints that timetric.com uses internally when you browse our data.

Calculated series

Through the timetric.com website, you’ve always had the ability to build model calculations, and to filter series. We’ve now exposed this at the API level as well, so you can build these models and filters programmatically.

Cross-domain requests

If you’re a web developer, you’ll be all too familiar with the headaches of restrictions on cross-domain requests. In many cases, there are perfectly good security-related reasons for them, but these restrictions make writing some web applications much harder than it ought to be.

Fortunately, the newest generation of browsers (Firefox 3.5, IE8, and Safari 4) let you make secure cross-domain requests directly — so long as the server supports it (see https://developer.mozilla.org/en/HTTP_access_control). Since this is such a useful feature — for us as much as anyone else – we’ve enabled it so you can use it too, and build much more exciting Timetric mashups in modern browsers.

Easier uploading

And finally, we had feedback from several people about ways in which we could make pushing data into the platform through the API a bit easier. The details are probably uninteresting unless you like constructing HTTP messages yourself (which I do, but it’s not everyone’s cup of tea!) so I’ll simply point you at the new documentation. In short, you can POST data directly, rather than having to multipart-encode it.

So …

If you’re a developer, get out there and play! We’re always happy to get any feedback – positive or negative!

New Dashboard

Yesterday we rolled out an exciting update to the Dashboard, which now looks something like this:

Dashboard v2

More important than the lick of paint, though, is all the new stuff you can do with it! Though it might sound boring, the Dashboard is all about lists. The old Dashboard had only one — a list of series you’ve ’starred’. We’ve added two others: ”My Series” for series you’ve created yourself and “Recently Viewed” which shows you the last few series you’ve looked at. Soon we’ll let you create your own custom lists too.

Now that you have all those series at your fingertips, you’ll want to compare and analyze them, right? Just select the ones you’re interested in, and click ‘Build’, ‘Overlay’ or ‘Versus’ at the top!

Talking about building… If you’re a developer, you can now build models through our API as well — Toby’ll be blogging about that very soon.

Next Page →