Sunburnt: a python-solr interface

Over the last few months, we’ve been hard at work behind the scenes at Timetric, and a few of the results are now to be seen on the website. If you’ve been paying close attention, you might have noticed the appearance of machine tags, and of the ability to search series by value.

These are both reflections of one of the biggest changes we’ve made – we’ve entirely replaced the search infrastructure the site runs on. We’re now backed by Apache Solr, and we’ve written a new Python-Solr interface, called sunburnt.

We let users search using both free text search and drill-down tagging — we used to run these on a combination of postgres-backed full-search text and django-tagging, but this combination wasn’t particularly satisfactory. Unsurprisingly, when you’re trying to add search infrastructure to a site, what you really want is a proper search-engine backend.

For a mature, full-featured, well-supported open-source search engine, the choice boils down to Solr or Xapian. We were strongly tempted by the latter — there’s no shortage of Xapian expertise around Cambridge, but we were swayed by the Apache licensing of Solr, rather than Xapian’s GPL.

And although there are a number of existing Python-Solr interfaces, none of them did what we wanted, which was to provide an intelligent and robust Pythonic API, which lets you pass arbitrary objects in and out of Solr. So we built our own, and called it sunburnt.

Sunburnt is most directly comparable to Haystack, but with a couple of major differences. Firstly, it’s not restricted to Django model data, and secondly, it’s schema-driven rather than schema-generating — it lets you construct your own Solr schema, and automatically derives all the type-checking/conversion/coercion code necessary to map your objects to and from the Solr index, when constructing queries and exchanging data.

The only documentation at the moment is the examples below, but the code is all up on Github. Patches and contributions are more than welcome!

Sunburnt in use

To start indexing & querying, you initialize a SolrInterface with your schema. At the moment, you need to do this by passing in the schema xml — sunburnt won’t query the Solr server for its schema.

solr_interface = sunburnt.SolrInterface("http://localhost:8983", "schema.xml")

To index objects, add() them to the interface. sunburnt doesn’t care what form the data comes in, so long as

  • if it looks like an object, it has attributes named according to fields defined in the schema
  • if it looks like a dictionary, it has keys named according to fields defined in the schema
class Document(object):
    def __init__(self, title, contents):
        self.title = title
        self.contents = contents
 
documents = [
   {"title":"This is a dictionary", "contents":"Lorem ipsum"},
   Document("This is an object", "dolor")
]
 
solr_interface.add(documents)

If you haven’t set your Solr instance up to do autocommit, then you might want to commit your documents to the index:

solr_interface.commit()

after which the documents are searchable. The API is fairly close to that offered by Haystack (and indeed Django’s QuerySet) – unsurprisingly, since they’re solving similar problems.

solr_interface.query("This")

does what you might expect, searching on the default field.

solr_interface.query("This").filter("dictionary")

while chaining with filter() allows you to choose which parts of your queries are cached by Solr.

For fields representing numbers, or dates, then searching by range is useful, for example

solr_interface.query("This", last_modified__gt="2009-01-01")

if you have a last_modified field in your schema. Queries can be faceted – if you had tags on your objects, you might do this:

solr_interface.query("This").facet_by("tags", limit=20, mincount=1)

and if you wanted to search for similar documents, you can do a more-like-this query (in this case, looking for similarity in the tags field)

solr_interface.query("This").mlt("tags")

sunburnt doesn’t support all of the Solr API, but it gives you access to a goodly portion, and all of these operations are chainable.

solr_interface.query("This").filter("dictionary").\
    facet_by("tags", limit=20, mincount=1).\
    query(last_modified__gt="2009-01-01").paginate(rows=10)

Finally, having set up a query, you can get a result object back by execute()ing the query. This bit of the API is still a bit rough around the edges, but

r = solr_interface.query(...).execute()
r.result # has the main query results
r.facet_counts # has the faceting results
r.more_like_these # has any more_like_these results

and if you poke around in that object, it has all the rest of the information that Solr provides.

Sunburnt in practice

The code is now running live on Timetric, and is problem-free for us. We’ve been able to throw away scads of code working around djangosearch/django-tagging shortcomings, and performance is significantly faster all round, especially for anything regarding tagging. Most usefully though, it’s provided us with a platform to start experimenting with new navigation features much more rapidly.

Postscript: djangosearch/django-tagging shortcomings

djangosearch, though easy to set up (if you’re already using postgres), offers very little in the way of control over various parameters and options you might want to tune, and requires filtering/escaping of some queries. (Searching for a string with “£” in it causes interesting errors!)

django-tagging had slightly more serious issues;

  • we had to maintain our own fork of the codebase due to a couple of long-standing issues; corner cases which upstream weren’t interested in fixing (One that kept biting us was its lack of support for models with non-integer primary keys. Easily fixed, but support is still not included in upstream django-tagging).
  • extending its functionality turned out to involve writing a lot of hand-tuned and often inherently slow SQL. Writing related-tags functionality was particularly painful – it involves inverting the index, which is very time-consuming – we had to do that offline.
This entry was posted in Uncategorized. Bookmark the permalink.

9 Responses to Sunburnt: a python-solr interface

  1. Eric Pugh says:

    Love to see a new Python Solr library. I’ve been thinking that have easier access to what is encoded in the schema.xml would be nice for many different types of projects.

  2. John says:

    Interesting! How does this differ from the Python and other clients at http://www.lucidimagination.com/search/document/CDRG_ch11_11.5 ? Maybe you should contrib or post on the apache wiki?

  3. Toby White says:

    I wrote a separate post comparing the existing Python interfaces, at http://eaddrinu.se/blog/2010/sunburnt.html – in short, there’s nothing wrong with the existing clients, but they didn’t expose the interface I wanted.

  4. claudio says:

    great work. that’s exactly what i needed.
    it would indeed make more sense to take the schema directly from the solr http.

  5. Casper says:

    Thanks for sharing, with at little twaeking it seems to work on google app engine as well. By the way, may i recommend using the query param wt=json when selecting from solr just to avoid the overhead of xml.

    • Toby says:

      Glad to hear it was useful – can you share what you needed to do to make it work on AppEngine?

      I don’t use wt=json, because I have to deal with the overhead of XML when sending messages to Solr anyway, so it’s easier for me to not have to worry about two serialization formats!

      • Gora Khargosh says:

        I believe none of the dependencies of sunburnt
        work with Google App Engine:
        1. lxml — uses Python/C API, so is immediately ruled out.
        2. mxDatetime — uses Python/C API, so is immediately ruled out.
        3. pytz — is available as gaepytz, but I’m not sure whether it would work as well as pytz itself.

        I’m also wondering how Casper got sunburnt to work on App Engine. Do let people know if it is at all possible.

        Thanks.
        Gora Khargosh.

        • Casper says:

          The tweaking involves getting rid of the C dependencies, lxml is replaced by xml.etree.ElementTree, urllib is replaced by urllib2, So to be more specific, i works on appengine, but since my usage is limited to specific functionality, mainly “select”, it works for me just fine. I haven’t dealt with the mxdatetime yet, but i guess it’s no problem finding or coding a replacement. Sorry if my comment was a bit misleading.

  6. Royce says:

    Is it possible to add python dictionaries that contain sub-dictionaries? For example,

    dict(author=dict(book=dict(name=’1984′)))

    If possible, how do you represent this type of document in a Solr schema?

    -Royce

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">