An ideal data source
We source data from a number of large national, and trans-national, statistical bodies, like the Office of National Statistics here in the UK, or Eurostat. Downloading useful data from organizations like this is sometimes a tricky job – although publishing data is usually part of their raison d’être, they’re not usually thinking of people like us – Big Data geeks – when making their data available. And often, their methods of making data available have been essentially unchanged for the past ten or fifteen years, and even then are probably based on processes predating the Internet.
One of the sources of value Timetric adds is simply making this data more widely available and accessible. But it’s also true that there’s so much more we could do if we could put our minds to using this data in new and exciting ways, rather than expending expertise on working out the best way to map old-fashioned data publication workflows to a web-centric way of working. So it’s an interesting question to ask – in an ideal world, how would a large statistical organization publish data for us?
There’s three aspects to this question:
- Data transfer and formats
- Metadata formats and reconciliation
- Update frequency and notifications
1. Data transfer and formats
For us, the easiest data to deal with is probably — and perhaps counter-intuitively — either the ONS or Eurostat. That’s despite the fact that both of these present their data in fairly obscure, more-or-less undocumented dumps of 1980′s-era databases (at a guess).
However, in both of these cases, we can download the entire database in just a few files, largely one per data release, each containing several thousands, to tens of thousands, of series. We don’t have to run any queries to express which data we’d like, everything simply lives at a predictable URL. We don’t want to make hundreds of queries to get different subsets of the data, we mostly just want it all (though see below).
The formats and the URL schemes could be documented much better — but we’ve already done the job of reverse engineering them. As long as they don’t change significantly, it’s a trivially-repeatable set of operations to get the files, and extract the data from them. And each source is yielding a huge quantity of valuable data, so for that up-front investment in time, we get a good payoff.
For a new source, we’d be quite happy with anything along those lines. We don’t mind a bit of time in writing a parser for a new data format, or even in reverse-engineering some URL construction. That up-front cost isn’t a huge investment if there’s a lot of high-quality data, repeatably downloadable, waiting for us afterwards. That said, obviously we’d much rather have the data in a well-documented, simple format, and minimize that up-front investment. You can’t go very far wrong with CSV files lying behind well-established URLs.
What we really don’t like is API endpoints built around the idea that you only want a few series at a time, and you’ll be making the choice by hand. It’s no fun doing thousands of HTTP connections to get each and every data series (neither for us, having to track success/failure/retries – nor for the servers, having to deal with us flooding their API). It’s also no fun trying to work out various combinations of query parameters until we get just what we want. That’s especially painful when they’re query parameters for forms designed originally to be driven by human interactions. But even when they are aimed at computer downloads, there’s still far too many API developers who still haven’t thought about API discoverability. (And we definitely don’t want these forms submitted by POST. Bang, there goes your cache, and our chances of getting data quickly.) All in all, we’d rather just have data dumps to download.
In short, APIs which are good for exposing small quantities of data to individual users aren’t very good for exposing large quantities of data for reuse on a large scale. And formats don’t really matter at all.
2. Metadata schemata, formats, and reconciliation.
Again, surprisingly, there’s something to be said for the Eurostat approach to this – but this time not the ONS. Eurostat have a fairly cryptic set of metadata codes, encoded in a rather bizarre way within the data, which only directly apply to their own data, and are probably the result of several decades of semi-random accretion. There’s no international standards in use here. On the other hand, they are well-documented, and once you’ve worked out how to extract and decode them, you’ve got a nice, consistent set of metadata across tens of thousands of data sets. That’s a far better state of affairs than some data suppliers, who give us little or no metadata, and certainly don’t have a well-documented background to their metadata terms (collection methods, statistical processes, industrial classifications etc).
(The ONS, by comparison, are not useful in this regard. They are very precise about their metadata, and have reams upon reams of well-written documentation about statistical standards. However, almost none of this metadata can be linked up with their data in any automatic way. The data themselves come with nothing except very short titles, often with enigmatically and inconsistently abbreviated technical terms.)
If you’re a large well-established national or trans-national body, and you’ve got your own internal metadata — please just expose it! At least that way, we can arrange all your data consistently with respect to itself, and probably start linking the obvious bits of metadata across multiple sources. We’d much rather have that now, than wait on a perfect standard further down the road.
On the other hand, if you’re starting from scratch these days, you could do much better. Our lives would be made much easier if people used metadata which was drawn from some standardized vocabulary, so we could reconcile metadata between different suppliers. If you were beginning the process today, the obvious place to start now is with SDMX (and see the tutorial from the European Central Bank).
At the moment, we have to do lots of that reconciliation ourselves. You can automate surprisingly large amounts of the work, but by no means all. It definitely still requires human intervention, and often from someone who’s fairly economically literate. Enormous amounts of the work we’ve done has gone into building tools to let us leverage that human intervention as much as possible, to develop semi-automated workflows for metadata reconciliation.
In short, ideally, everyone would use internationally-recognized standards of metadata and reporting. But if they don’t, or can’t yet, the most useful thing they can do now, is to make as much of their internal metadata systems documented, available for reuse, and mark up as much of their data with it as possible. Making that available now would be an immediate gain for everyone. Waiting around for people to map their internal metadata systems on to SDMX doesn’t help anyone nearly as much.
3. Update frequency and notifications
For most data providers on this scale, different series are updated at different times, on different release schedules. A naïve approach to dealing with this is to simply download the entire dataset daily, and reprocess it to find what’s changed. This has the problem that:
- it costs us quite a bit of processing time, much of which is entirely unnecessary, meaning data isn’t as available as quickly as it should be,
- it costs the data provider in terms of bandwidth on data that’s downloaded unecessarily, just so we could check it’s not changed,
- it leaves open the question of *when* we should do this downloading. We want the data as soon as its released – but we have no way of finding out when that is. All we can do is download frequently enough that we aren’t likely to be too slow in catching new data (while not running afoul of either of the other two problems).
There are various ways around this. The ONS, for example, always makes data releases at 09:30 UK time (or very shortly thereafter), so that’s when we check their site. Unfortunately, they don’t tell you (in a machine-readable way) what has changed, so we still have to process an awful lot of unchanged data.
An easy way for them, or indeed anyone, to do this right is just to use HTTP timestamps on data dump files. We could simply do a HEAD on the URL, check whether the data has changed, and download the contents only if they were new.
If they wanted to be even more helpful, they could provide notification services for us to subscribe to, letting us know as soon as data was updated. But to be perfectly honest, I wouldn’t be very concerned with them doing that. It’s another moving part to go wrong, and I’d still be very tempted to poll their URLs anyway to check that there weren’t updates being provided which notifications weren’t working for.
Finally, if I were a data provider, I’d make very sure I had a good cache in front of everything. If you are doing what we want in terms of data dumps, then there’s nothing easier to cache than GETs to a set of unchanging URLs, with relatively-infrequently changing contents at each URL. Even if you are using something else, then caching should be an important part of your technical strategy.
People make mistakes, and they will end up accidentally downloading too much, too often, or accidentally letting a badly-written client run loose. You can shout at them and deny them access, but nobody’s going to be happy about that, whoever’s fault it is, and that’s still going to leave you open to the risk of people overloading you before you get round to banning them. (See the police.uk fiasco for how not to handle this!)
Thinking about your users
Sometimes I wonder if perhaps we’re atypical of the sorts of users that statistical organizations need to deal with — but on reflection, I don’t think that’s true. (and to the extent it is, I believe we’re in the vanguard of a much broader spectrum of people who want to reuse data like this.) We’re probably higher volume than most users right now, but we’re probably also more likely to be prepared to deal with poorly designed systems, and persist when many wouldn’t. Organizations that put time and effort into making data useful to users like us will find their data much more widely used than those who don’t.
Ultimately, if an organization like we’re discussing here wants to make its data more useful to everyone, there’s one big lesson to be learnt, and that is: “You’re not going to be able to predict what your users want”. The best job a public data organization can do is make as much of the data (and metadata!) available as it can, in as untrammelled a form as possible. Don’t try and second guess what people want to do – just let them have everything. As long as you don’t stand in their way, people will use and re-use your data in ways you could never have foreseen.