Upgrading the Timetric backend.

As you might have noticed, this week involved a flurry of activity at Timetric HQ – on Wednesday we pre-announced downtime for a database upgrade on Friday, but we ended up accelerating our schedule, and doing the upgrade on Thursday instead. We thought it might be useful to offer an explanation of what was going on!

The backstory is that Timetric runs on a pair of databases for its backend. One is a traditional RDBMS (Postgres, as it happens), which is used for storing all metadata; the other is a non-relational DB, used for storing all our timeseries data. It’s the latter which was at issue – we needed to make the fairly major change from using HBase to using Tokyo Cabinet/Tyrant.

When we first started building Timetric, we knew we wanted to use a non-relational DB for our timeseries data. There’s a number of options out there (for a recent overview, see Bob Ippolito’s talk “Drop ACID and think about data” at PyCon 2009). We did some initial experimentation, and ended up going with HBase. It seemed like a good match – it has timestamped versioning for all its data, which seemed to fit data that’s inherently time-based; it’s a high-profile project, at the time just adopted by Apache; it’s got a lively, helpful developer community; the codebase seemed relatively robust, and was comprehensible (so much so that despite not really being a Java programmer, I was able to offer a few minor patches easily enough). And it has a very nice scaling story, up to billions of rows, which is nice to have in reserve!

Its major downside was that its performance wasn’t really yet ready for use behind an interactive website – nevertheless, we were still in early alpha at the time. Performance improvements were high up the developer team’s wishlist, and we didn’t seem to be the only people interested in using it as backing store for a web application, so we had hopes that things would improve. and, indeed they did – current HBase performance is significantly better than it was 6 months ago.

Nevertheless, a few weeks ago, we decided that we couldn’t carry on working with HBase indefinitely; we took another look around, and made the choice to migrate to Tyrant.

There were a few reasons for this:

Fundamentally, we need a much simpler, more easily manageable, and faster solution. Fortunately, Tyrant fulfils all of these criteria for us. 

So, over the last few weeks, we’ve been planning this changeover; writing the new backend interface, testing, bugfixing, load-testing, optimizing, etc., with the aim of making the changeover today.

However, yesterday morning, events overtook us. The server logs started showing worrying error messages – the sort you really don’t like to see – about missing data. On further investigation, it turned out that our HBase instance was dropping data on the floor, left, right and centre.

Fortunately, we keep good backups!! I’ll emphasise that before going any further. (I should also note that we still have no idea why the data corruption started happening – although clearly it was HBase dropping the data, I’m not laying all the blame on its head; we haven’t had time to find out what might have prompted the problem.)

We deliberated briefly about what to do – we knew we had backups, and could restore them to HBase – we’ve tested that process before, so we knew how long it would take, a matter of a few hours. On the other hand, we’d been planning to do the migration the next day anyway. The migration would take a bit longer than simply restoring the backups; and because we’d been forced into it, would take slightly longer than was originally planned. On the whole, though, we thought it was best to cut our losses and kickstart the migration immediately.

So, the result was what you saw yesterday! Stress levels were fairly high all day (not helped by one of our number being in self-imposed quarantine – Dan came down with a cold (not swine flu!) yesterday morning, so kept himself at home, though he was working hard online. Nevertheless, the migration went almost entirely smoothly; much better than the worst case scenarios I’d dreamt up, and as you saw we were up again before the day was out.

And the result is fairly impressive, I think. Currently, most of the improvements are either fairly well-hidden, or amount to support for additional features which we’ll be building out in the next short while. But the most noticeable gain is that the whole site is much snappier now – page load times, especially for large series, have dropped dramatically. Upload times are hugely improved as well, which is letting us do some quite exciting things.

Comments

Leave a Reply