Unsolicited advice for large governmental data providers

An ideal data source

We source data from a number of large national, and trans-national, statistical bodies, like the Office of National Statistics here in the UK, or Eurostat. Downloading useful data from organizations like this is sometimes a tricky job – although publishing data is usually part of their raison d’être, they’re not usually thinking of people like us – Big Data geeks – when making their data available. And often, their methods of making data available have been essentially unchanged for the past ten or fifteen years, and even then are probably based on processes predating the Internet.

One of the sources of value Timetric adds is simply making this data more widely available and accessible. But it’s also true that there’s so much more we could do if we could put our minds to using this data in new and exciting ways, rather than expending expertise on working out the best way to map old-fashioned data publication workflows to a web-centric way of working. So it’s an interesting question to ask – in an ideal world, how would a large statistical organization publish data for us?

There’s three aspects to this question:

  1. Data transfer and formats
  2. Metadata formats and reconciliation
  3. Update frequency and notifications

1. Data transfer and formats

For us, the easiest data to deal with is probably — and perhaps counter-intuitively — either the ONS or Eurostat. That’s despite the fact that both of these present their data in fairly obscure, more-or-less undocumented dumps of 1980′s-era databases (at a guess).

However, in both of these cases, we can download the entire database in just a few files, largely one per data release, each containing several thousands, to tens of thousands, of series. We don’t have to run any queries to express which data we’d like, everything simply lives at a predictable URL. We don’t want to make hundreds of queries to get different subsets of the data, we mostly just want it all (though see below).

The formats and the URL schemes could be documented much better — but we’ve already done the job of reverse engineering them. As long as they don’t change significantly, it’s a trivially-repeatable set of operations to get the files, and extract the data from them. And each source is yielding a huge quantity of valuable data, so for that up-front investment in time, we get a good payoff.

For a new source, we’d be quite happy with anything along those lines. We don’t mind a bit of time in writing a parser for a new data format, or even in reverse-engineering some URL construction. That up-front cost isn’t a huge investment if there’s a lot of high-quality data, repeatably downloadable, waiting for us afterwards. That said, obviously we’d much rather have the data in a well-documented, simple format, and minimize that up-front investment. You can’t go very far wrong with CSV files lying behind well-established URLs.

What we really don’t like is API endpoints built around the idea that you only want a few series at a time, and you’ll be making the choice by hand. It’s no fun doing thousands of HTTP connections to get each and every data series (neither for us, having to track success/failure/retries – nor for the servers, having to deal with us flooding their API). It’s also no fun trying to work out various combinations of query parameters until we get just what we want. That’s especially painful when they’re query parameters for forms designed originally to be driven by human interactions. But even when they are aimed at computer downloads, there’s still far too many API developers who still haven’t thought about API discoverability. (And we definitely don’t want these forms submitted by POST. Bang, there goes your cache, and our chances of getting data quickly.) All in all, we’d rather just have data dumps to download.

In short, APIs which are good for exposing small quantities of data to individual users aren’t very good for exposing large quantities of data for reuse on a large scale. And formats don’t really matter at all.

2. Metadata schemata, formats, and reconciliation.

Again, surprisingly, there’s something to be said for the Eurostat approach to this – but this time not the ONS. Eurostat have a fairly cryptic set of metadata codes, encoded in a rather bizarre way within the data, which only directly apply to their own data, and are probably the result of several decades of semi-random accretion. There’s no international standards in use here. On the other hand, they are well-documented, and once you’ve worked out how to extract and decode them, you’ve got a nice, consistent set of metadata across tens of thousands of data sets. That’s a far better state of affairs than some data suppliers, who give us little or no metadata, and certainly don’t have a well-documented background to their metadata terms (collection methods, statistical processes, industrial classifications etc).

(The ONS, by comparison, are not useful in this regard. They are very precise about their metadata, and have reams upon reams of well-written documentation about statistical standards. However, almost none of this metadata can be linked up with their data in any automatic way. The data themselves come with nothing except very short titles, often with enigmatically and inconsistently abbreviated technical terms.)

If you’re a large well-established national or trans-national body, and you’ve got your own internal metadata — please just expose it! At least that way, we can arrange all your data consistently with respect to itself, and probably start linking the obvious bits of metadata across multiple sources. We’d much rather have that now, than wait on a perfect standard further down the road.

On the other hand, if you’re starting from scratch these days, you could do much better. Our lives would be made much easier if people used metadata which was drawn from some standardized vocabulary, so we could reconcile metadata between different suppliers. If you were beginning the process today, the obvious place to start now is with SDMX (and see the tutorial from the European Central Bank).

At the moment, we have to do lots of that reconciliation ourselves. You can automate surprisingly large amounts of the work, but by no means all. It definitely still requires human intervention, and often from someone who’s fairly economically literate. Enormous amounts of the work we’ve done has gone into building tools to let us leverage that human intervention as much as possible, to develop semi-automated workflows for metadata reconciliation.

In short, ideally, everyone would use internationally-recognized standards of metadata and reporting. But if they don’t, or can’t yet, the most useful thing they can do now, is to make as much of their internal metadata systems documented, available for reuse, and mark up as much of their data with it as possible. Making that available now would be an immediate gain for everyone. Waiting around for people to map their internal metadata systems on to SDMX doesn’t help anyone nearly as much.

3. Update frequency and notifications

For most data providers on this scale, different series are updated at different times, on different release schedules. A naïve approach to dealing with this is to simply download the entire dataset daily, and reprocess it to find what’s changed. This has the problem that:

  • it costs us quite a bit of processing time, much of which is entirely unnecessary, meaning data isn’t as available as quickly as it should be,
  • it costs the data provider in terms of bandwidth on data that’s downloaded unecessarily, just so we could check it’s not changed,
  • it leaves open the question of *when* we should do this downloading. We want the data as soon as its released – but we have no way of finding out when that is. All we can do is download frequently enough that we aren’t likely to be too slow in catching new data (while not running afoul of either of the other two problems).

There are various ways around this. The ONS, for example, always makes data releases at 09:30 UK time (or very shortly thereafter), so that’s when we check their site. Unfortunately, they don’t tell you (in a machine-readable way) what has changed, so we still have to process an awful lot of unchanged data.

An easy way for them, or indeed anyone, to do this right is just to use HTTP timestamps on data dump files. We could simply do a HEAD on the URL, check whether the data has changed, and download the contents only if they were new.

If they wanted to be even more helpful, they could provide notification services for us to subscribe to, letting us know as soon as data was updated. But to be perfectly honest, I wouldn’t be very concerned with them doing that. It’s another moving part to go wrong, and I’d still be very tempted to poll their URLs anyway to check that there weren’t updates being provided which notifications weren’t working for.

Finally, if I were a data provider, I’d make very sure I had a good cache in front of everything. If you are doing what we want in terms of data dumps, then there’s nothing easier to cache than GETs to a set of unchanging URLs, with relatively-infrequently changing contents at each URL. Even if you are using something else, then caching should be an important part of your technical strategy.

People make mistakes, and they will end up accidentally downloading too much, too often, or accidentally letting a badly-written client run loose. You can shout at them and deny them access, but nobody’s going to be happy about that, whoever’s fault it is, and that’s still going to leave you open to the risk of people overloading you before you get round to banning them. (See the police.uk fiasco for how not to handle this!)

Thinking about your users

Sometimes I wonder if perhaps we’re atypical of the sorts of users that statistical organizations need to deal with — but on reflection, I don’t think that’s true. (and to the extent it is, I believe we’re in the vanguard of a much broader spectrum of people who want to reuse data like this.) We’re probably higher volume than most users right now, but we’re probably also more likely to be prepared to deal with poorly designed systems, and persist when many wouldn’t. Organizations that put time and effort into making data useful to users like us will find their data much more widely used than those who don’t.

Ultimately, if an organization like we’re discussing here wants to make its data more useful to everyone, there’s one big lesson to be learnt, and that is: “You’re not going to be able to predict what your users want”. The best job a public data organization can do is make as much of the data (and metadata!) available as it can, in as untrammelled a form as possible. Don’t try and second guess what people want to do – just let them have everything. As long as you don’t stand in their way, people will use and re-use your data in ways you could never have foreseen.

Posted in Uncategorized | 2 Comments

What we mean by “reinventing business research”

You may (or may not) have seen that two weeks ago, together with Horace Dediu of Asymco, we published an interactive report about the state of the global mobile phone industry.  This is our first step towards what we’ve termed the “reinvention of business research”, and this term is what I’d like to discuss in more detail here.

Timetric is one of the many start-up companies in the data-as-a-service space.  I’m sure that multiple research reports(!) will tell you that this sector is forecast to grow substantially over the next few years, which means there’s quite a lot of interest in the things we’re up to and why.

Personally, I joined Timetric because I’m shocked by the time, resource and money being thrown away by many firms on sourcing and managing data in primitive ways.

Some examples of the things I find especially frustrating:

  • the amount of data which is duplicated because it is not shared intelligently, but sits within infinite numbers of spreadsheets that always seem like too much hassle to combine;
  • the amount of extra work which has to go into many projects, because people re-do research which someone else in the firm has already done but which they haven’t been informed about;
  • and, ultimately, the time that intelligent and expensive people waste just updating a couple of data series manually

Like the supposed fact that we know more about the surface of Mars than the seabed on our own planet, it’s often true that we know more about what’s going on outside our organisation that inside it.

My personal feeling is that the research industry, much like the media industry, need a rethink. The only difference is that the research industry’s average-revenue-per-view is much higher, so it’ll get away with being inefficient for longer.

However, I do believe the timeframe during which they need to do this rethink is shrinking rapidly. This is driven both by demand-side factors, such as the fact that more and more business is being done in emerging markets where deal-sizes are smaller but data is essential, and supply factors, which is mainly that companies in the data-as-a-service space are making the acquisition and visualisation of data much cheaper.  This means that good, reliable data needs to be cheaper, and can be cheaper.  The best analysts will remain expensive, but their output will be maximised when we’re able to give them better data, quicker.

So, why do I think Timetric can help solve those problems which frustrate me so much?

We’re providing our services to analysts who want to publish their research in a user-friendly format, and spend more of their time on the analysis of data rather than finding, inputting, cleaning, checking, uploading and visualising tasks.  If that’s something you’re interested in, just drop us an email to talk further.

Finally, it’s important to give special note to Horace Dediu with whom we launched our first report.  An incredibly forward-thinking and practical analyst who is able to see the bigger picture whilst appreciating that small details can make a massive difference.  We could not have asked for a more appropriate partner to launch with.  Many thanks for working with us.

Posted in about us, business, data as a service, monetizing your data, publishing, timetric reports | Leave a comment

Stacked bar and area charts!

In the last post on this blog, you might have seen a chart of mobile phone handset sales taken from the Asymco Mobile Phone Market Report (£125 now on Timetric). It didn’t look much like the charts you’d have seen before on Timetric – in fact, it was a new chart type, a stacked area chart.

You can make your own now on Timetric! Whenever you create a graph, tick the “Stacked chart” option:

Creating a stacked chart

and you’ll get something which looks like this:

and when you embed it, like this:

Internet browser market share, (%) by type from Timetric

(It works if you’re building bar charts too, incidentally.)

Let us know what you think! It’s a great tool when you’re building market reports, which is convenient, because we’re looking for analysts who want to write them (and more on that soon, but if you’re one, email us at contact@timetric.com).

Posted in benchmark, chartroom, embeds, graphs, publishing, user interface | Leave a comment

Reinventing reports: the Asymco Q4 2010 Mobile Phone Report, powered by Timetric

So we weren’t kidding about being busy. Alongside Chartroom and Benchmark, we’ve been up to a third thing: reinventing business research. Over the last couple of months, we’ve been working closely with Horace Dediu of Asymco, Fortune Magazine’s “King of Apple Analysts” and one of the leading analysts of the smartphone market.

Today, we’re launching an interactive, HTML5 (with Flash fallback), fully data-backed, readable-anywhere research report written by Horace and powered by Timetric. (And we throw in a downloadable PDF — even though, of course, they’re now officially retro).

The Asymco Mobile Phone Market Overview: Q4 2010. (Table of contents here). Six months interactive online access plus a downloadable PDF is yours for £125 (about $199 or €169).

Here’s what we mean by interactive – total smartphone unit sales, per manufacturer:

Asymco : Mobile Phone Industry from Timetric

And here’s a demo video Horace Dediu has put together:

We think it’s the best way to read business research, and we’re really fortunate to be working with such a terrific analyst. We hope you like it! (We think you will.)

PS: if you’re a writer or analyst, and you want to break free from the chains of resellers and huge research factories — or just make money for what you already do for your blog or newspaper — we want to hear from you. We’re looking for great people to work with. Email us now, letting us know what sector you write about. We’re looking forward to meeting you.

Posted in about us, asymco, business, data, monetizing your data, publishing, timetric reports | Leave a comment

Customize your charts

When we talk to people about Timetric, we always ask what features they’d like that we don’t have. A lot of people have asked for two in particular:

  • being able to change the colours of lines on our graphs;
  • being able to set custom labels for lines on legends.

Today’s a good day for all of those people! Our new graph-creation wizard — which appears any time you click an embed button to get a graph to share — lets you do both of these things, and quite a lot of other new things besides. It looks like this:

click for full size

giving you graphs which look like this — this one’s browser-adoption curves for all the versions of Chrome:

Internet browser market share, (%) by version from Timetric

As well as setting colours and series titles, you can now:

  • choose how each series in the chart is shown – as a line, as a colour-filled area (like above), or as bars
  • set the height and width of the graph
  • toggle markers on or off (or leave it to Timetric to decide)
  • set whether you want the graph to be updated when new data comes in

We’re looking forward to seeing how people use these. The bar charts, in particular, look great, particularly when you combine multiple styles on one chart:

Digital Natives Youth and Internet from Timetric

Drop us a note in the comments if you need a hand. (And if you’d like to lend one instead, we’re hiring, and we’d love to hear from you!)

Posted in chartroom, embeds, flash, graphs, news, plotting, user interface | Leave a comment