Thanks to everyone who came along last night to DJUGL, to see me (and Nicholas Tollervey, and Mat Clayton) speak.
My topic for the night was “Scaling search to a million pages, with Solr, Python and Django”. I’ve put the slides up at SlideShare (direct PDF link) if anyone wants them.
The tl;dr is summarized on the last-but-one-slide. If you want to be able to scale your search across millions of pages, and still get good results from your users, then you need to pay attention to some details at the small scale, and some details at the large scale.
At the small scale, you need to spend time thinking about how to construct your index schema. What queries do you want to be able to run, and what information do you need to present when your search results come back? The shape of your index schema needs to be driven entirely by the answers to these two questions, and that depends heavily on the shape of your data, and the way your users want to interact with it.
On the large scale, each installation will have its own problems, but three things you’ll almost certainly need to pay attention to are:
- Decoupling reading from and writing to the index. They have very different performance characteristics (and writing presents special problems if you’re updating documents as well as adding brand new documents).
- Working out the right balance of adding/commiting/optimizing data. This will be driven by the frequency with which you add data, and how soon you need to be able to serve results from newly-added data. Must it be immediate, or can you wait seconds/minutes/hours?
- Fine-tuning your tokenizers/analyzers. Although small and fiddly, this is an issue which will bite you more and more as a corpus of data grows. You’ll need to tweak your indexing algorithms away from the defaults; extracting relevant results from a pile of a million documents is much harder than from a few thousand.
I also took the opportunity to plug my Python/Solr library, sunburnt. It’s a work in progress, but it’s battle-tested here at Timetric. If you’re trying to use Solr in any interesting Python project, I think its API is worth a look.