Upgrading the Timetric backend.
As you might have noticed, this week involved a flurry of activity at Timetric HQ – on Wednesday we pre-announced downtime for a database upgrade on Friday, but we ended up accelerating our schedule, and doing the upgrade on Thursday instead. We thought it might be useful to offer an explanation of what was going on!
The backstory is that Timetric runs on a pair of databases for its backend. One is a traditional RDBMS (Postgres, as it happens), which is used for storing all metadata; the other is a non-relational DB, used for storing all our timeseries data. It’s the latter which was at issue – we needed to make the fairly major change from using HBase to using Tokyo Cabinet/Tyrant.
When we first started building Timetric, we knew we wanted to use a non-relational DB for our timeseries data. There’s a number of options out there (for a recent overview, see Bob Ippolito’s talk “Drop ACID and think about data” at PyCon 2009). We did some initial experimentation, and ended up going with HBase. It seemed like a good match – it has timestamped versioning for all its data, which seemed to fit data that’s inherently time-based; it’s a high-profile project, at the time just adopted by Apache; it’s got a lively, helpful developer community; the codebase seemed relatively robust, and was comprehensible (so much so that despite not really being a Java programmer, I was able to offer a few minor patches easily enough). And it has a very nice scaling story, up to billions of rows, which is nice to have in reserve!
Its major downside was that its performance wasn’t really yet ready for use behind an interactive website – nevertheless, we were still in early alpha at the time. Performance improvements were high up the developer team’s wishlist, and we didn’t seem to be the only people interested in using it as backing store for a web application, so we had hopes that things would improve. and, indeed they did – current HBase performance is significantly better than it was 6 months ago.
Nevertheless, a few weeks ago, we decided that we couldn’t carry on working with HBase indefinitely; we took another look around, and made the choice to migrate to Tyrant.
There were a few reasons for this:
- HBase performance was improving, but it was becoming apparent that the loads we were placing on it were atypical; it was coping, but we were having to do some relatively baroque optimizations within the web application layer to get acceptable performance for large datasets.
- We’ve had people getting in touch about running private Timetric instances. For these, having one single, huge, database for holding all our data is less important – rather, we need multiple, per-instance databases. Scaling at the high-end is less important, and management at the low-end is more important; HBase isn’t trivial to manage – it’s got a lot of moving parts, and automating the management of large-scale deployments is a complex task (see, for example, SmartFrog).
- Most importantly, when you’re managing multiple instances, reliability becomes a far greater concern. Not that it’s more important, but there are more points of failure; managing that reliably is hard. HBase is a large system, with a relatively small user-base – while we’ve been in a position to deal with running it behind Timetric.com, we weren’t confident in our ability to do so behind multiple instances of the platform, for multiple customers.
Fundamentally, we need a much simpler, more easily manageable, and faster solution. Fortunately, Tyrant fulfils all of these criteria for us.
- It’s astonishingly fast – so much so that we’ve actually switched memcached off in a number of situations, because it’s quicker simply to get the data straight from the source. (This is partly due to the fact that Django’s interface to memcached requires you to use pickle, which is orders of magnitude slower than using simplejson for simple data structures).
- Setup is ridiculously simple. Honestly, why can’t all databases be this simple? Compilation is a bog-standard “./configure && make && make install”, there’s literally no configuration necessary to get a database going which is optimised for most common use cases, and there’s one command to start & stop it.
- Despite some reasonably hard pushing at it, we’ve had no hint of data corruption – or even transaction failures – against Tyrant. And the knowledge that it’s been heavily stress-tested elsewhere gives a nice warm fuzzy feeling.
So, over the last few weeks, we’ve been planning this changeover; writing the new backend interface, testing, bugfixing, load-testing, optimizing, etc., with the aim of making the changeover today.
However, yesterday morning, events overtook us. The server logs started showing worrying error messages – the sort you really don’t like to see – about missing data. On further investigation, it turned out that our HBase instance was dropping data on the floor, left, right and centre.
Fortunately, we keep good backups!! I’ll emphasise that before going any further. (I should also note that we still have no idea why the data corruption started happening – although clearly it was HBase dropping the data, I’m not laying all the blame on its head; we haven’t had time to find out what might have prompted the problem.)
We deliberated briefly about what to do – we knew we had backups, and could restore them to HBase – we’ve tested that process before, so we knew how long it would take, a matter of a few hours. On the other hand, we’d been planning to do the migration the next day anyway. The migration would take a bit longer than simply restoring the backups; and because we’d been forced into it, would take slightly longer than was originally planned. On the whole, though, we thought it was best to cut our losses and kickstart the migration immediately.
So, the result was what you saw yesterday! Stress levels were fairly high all day (not helped by one of our number being in self-imposed quarantine – Dan came down with a cold (not swine flu!) yesterday morning, so kept himself at home, though he was working hard online. Nevertheless, the migration went almost entirely smoothly; much better than the worst case scenarios I’d dreamt up, and as you saw we were up again before the day was out.
And the result is fairly impressive, I think. Currently, most of the improvements are either fairly well-hidden, or amount to support for additional features which we’ll be building out in the next short while. But the most noticeable gain is that the whole site is much snappier now – page load times, especially for large series, have dropped dramatically. Upload times are hugely improved as well, which is letting us do some quite exciting things.
Comments
Leave a Reply