Ethics, data and visualization

Last weekend, I represented Timetric at Science Foo Camp 2010, held at Google’s campus in California. While I was there, I gave a lightning talk about something which we’ve been thinking a lot about recently; the ethics of services like ours.

The tools we use shape how we solve the problems we face. Timetric’s designed to help you solve problems through data. So what does this mean for services like ours?

Tools Shape Thought

With Timetric, we aim to make it easy for you to ask questions of, and draw conclusions from, the world’s statistics. In that light, here’s the key quote:

Our tools shape the questions we ask; therefore they shape the answers we get, and therefore they shape the conclusions we draw.

This is a hard problem. Several of us at Timetric have a background in scientific research. Our products are used by journalists. Journalists and scientists have a responsibility to be honest and trustworthy. We don’t take the responsibility this places on us lightly.

Google famously coined “don’t be evil”, but it goes further than that. Your credibility, if you rely on us, depends on our credibility, which makes credibility our business. We want you to trust the conclusions you draw using Timetric. That’s why we built it. So here’s how we work, and what we promise to you:

  • We’ll always publish the original source of our data directly alongside it.
  • We’ll give our data the best and most helpful titles we can, and we’ll surround it with as much supplementary information as we can find, so that you can work out if it’s really the data you need.
  • We won’t editorialize or fudge data. The data on Timetric is the data as we received it — all we do is transform it into Timetric’s native format and clean up any mechanical errors we find in it. What you get is the best and most transparent version of the statistics we serve that we can give you.
  • We’ll keep our visualization and analysis tools as simple, as easy to use, and as transparent as we can make them.
  • If you find mistakes in our data, and you tell us about them, we’ll fix them — and what’s more we’ll tell you what we did to fix them.
  • When we make mistakes — as everyone does — we expect you to call us on them. We’ll discuss them openly with you, and again, once we’ve diagnosed the problems, we’ll tell you what we’re doing to fix them.

We take your integrity very seriously. So we can do that, we take our integrity very seriously. Without that commitment, no information service deserves your trust, and your trust is the most important thing you can give us. Every day, we come to work knowing we need to earn that trust, and we’re grateful that you choose to use Timetric. We won’t forget what that means.

Posted in about us, benchmark, data, politics | Leave a comment

DJUGL talk: Scaling search to a million pages with Solr, Python and Django

Thanks to everyone who came along last night to DJUGL, to see me (and Nicholas Tollervey, and Mat Clayton) speak.

My topic for the night was “Scaling search to a million pages, with Solr, Python and Django”. I’ve put the slides up at SlideShare (direct PDF link) if anyone wants them.

The tl;dr is summarized on the last-but-one-slide. If you want to be able to scale your search across millions of pages, and still get good results from your users, then you need to pay attention to some details at the small scale, and some details at the large scale.

At the small scale, you need to spend time thinking about how to construct your index schema. What queries do you want to be able to run, and what information do you need to present when your search results come back? The shape of your index schema needs to be driven entirely by the answers to these two questions, and that depends heavily on the shape of your data, and the way your users want to interact with it.

On the large scale, each installation will have its own problems, but three things you’ll almost certainly need to pay attention to are:

  • Decoupling reading from and writing to the index. They have very different performance characteristics (and writing presents special problems if you’re updating documents as well as adding brand new documents).
  • Working out the right balance of adding/commiting/optimizing data. This will be driven by the frequency with which you add data, and how soon you need to be able to serve results from newly-added data. Must it be immediate, or can you wait seconds/minutes/hours?
  • Fine-tuning your tokenizers/analyzers. Although small and fiddly, this is an issue which will bite you more and more as a corpus of data grows. You’ll need to tweak your indexing algorithms away from the defaults; extracting relevant results from a pile of a million documents is much harder than from a few thousand.

I also took the opportunity to plug my Python/Solr library, sunburnt. It’s a work in progress, but it’s battle-tested here at Timetric. If you’re trying to use Solr in any interesting Python project, I think its API is worth a look.

Posted in infrastructure, search | Leave a comment

Timetric.com: the route to one million time-series

Over the last few months we’ve been working hard on building the range of statistics we cover here at Timetric. The other day we surpassed the one-million-series mark. We thought you might want to know how we’ve done it, especially as these series aren’t static; we actively, and automatically, check each one for changes periodically. Thousands are updated daily – check our front page for the most recently updated.

All the data in Timetric is uploaded by a subsystem we call the “Big Dataset Uploader”. This goes away and pulls in data from various organization’s websites, FTP servers, or wherever else it’s to be found, beats it around until it’s in the right shape, then uploads it. Ideally we’d get all our data from consistent and well defined web services; The World Bank’s API is a good example for others to follow in this regard.

In general, though, the biggest help has been using proven, open source software components. We’ve been able to draw on the wealth of knowledge available and contribute back wherever possible; Toby’s sunburnt library, a Python interface to the Solr search engine, for instance.

We’ve been building Timetric almost entirely in Python. You might be surprised by that – there are much faster compiled languages – but it’s worked out well for us. Python has great libraries. In particular, with Numpy, it has very good numeric performance for a scripting language. With a small team the productivity and maintainability advantages more than compensate for any performance hit.

We’re also making use of Ubuntu, Postgres, jQuery, Tokyo Cabinet, Memcached, Git and RabbitMQ. Best-of-breed software throughout the stack, which makes our lives so much easier!

Posted in about us, data, infrastructure | Leave a comment

Sharing Timetric with your colleagues

At Timetric, we reckon the most important way you can use data is to use it to understand things and persuade people. So we’ve been busy building things to help you out with that, and here’s a new feature which has come from that: you can now share indexes on Timetric with your friends and colleagues by email!

A lot of our pages now have an email button:

email button

If you click that button where you see it, and fill out the form, we’ll handle the rest. Here’s how it works:

And your friend will get an email like this:

Shared index email from timetric.com

Really simple and, we hope, really useful. Let us know what you think.

Posted in Uncategorized | Leave a comment

SVG graphs on timetric.com

When we launched timetric.com a little over a year ago, we needed a visualization solution so people could see our lovely data. We looked around, and decided that for performance and cross-browser compatibility, we’d create our own Flash widget based on the Flex Data Visualization Components. This served us well, and will continue to do so for some time, but the times they are a-changing.

Today’s browser landscape looks quite different. Javascript performance his improved significantly, and IE6′s market share has halved. More significantly, there are a lot of browsers out there that aren’t even Flash-capable. Hello, Apple!

I’ll go into the techie details below, but the short version is this: if you used to see a big blank space in the middle of the page on Timetric, you’ll probably now see an attractive, interactive graph; if you’re used to seeing a “Loading…” indicator then a Flash graph you’ll probably still see that. However, if you’re feeling adventurous you can try adding ‘?noflash’ to the end of the URL and you’ll get the cool new thing everyone’s raving about.

Timetric's SVG graphing from an iPad

Techie Bits

We were looking for a plotting solution which matched the following criteria:

  • Standards compliant
  • Fast & light-weight
  • Cross-browser
  • Interactive (to both mouse events and touch)
  • Attractive

SVG vs. Canvas

This first criterion means that the choice is really between SVG and Canvas. The former is a true vector format, expressed in XML, and with nodes which can be manipulated via the DOM. The latter is essentially a scriptable bitmap image, which is often more efficient, but manipulation and interactivity aren’t what it’s good at. So which should we use?

When assessing the two technologies we need to consider three classes of browsers. The first are the modern, standards-compliant desktop browsers (Firefox, Chrome, Safari, Opera and hopefully soon IE9). The second include all shipping versions of Internet Explorer. The third group are new-fangled touch-based devices like Apple’s iPhone and iPad.

The first group is the easiest and least interesting: both SVG and canvas just work.

Visitors using IE make up around 38% of our userbase, yet they don’t support either of these technologies. Luckily, they do support Microsoft’s version of SVG, called VML. The excanvas library can be used to emulate the canvas element, and the actual drawing is done in VML. Alternatively, the excellent Raphaël library provides a vector drawing API which will output SVG to browsers that support it, and VML to IE. Take-home message: both can be made to work, but since either will result in VML nodes in IE’s DOM, it’s better to use the method that’ll take full advantage of that, namely SVG.

The touch-based devices provided the knock-out blow I was looking for. When you’re using the browser on your iPhone or iPad you’re generally looking at the page at an arbitrary zoom level. Bitmaps look pretty bad at arbitrary zoom levels, while vector graphics look clean and sharp.

Vector vs. bitmap graphics at arbitrary zoom level

That was rather a long-winded way of saying that SVG is better at drawing vector-y things like graphs.

Charting Frameworks

Having made the decision to use SVG via Raphaël, I took a look at gRaphaël. Sadly, the project seems rather immature, isn’t that well supported (only one commit in the last 9 months), and has no documentation. Since we’re heavy users of jQuery, what I really wanted was a version of the awesome Flot plotting library which could output SVG instead of Canvas.

Nothing like that seemed to exist, so I set about replacing Flot’s drawing code with Raphaël calls. The result I’m calling Raphlot, and it’s available on GitHub under the same MIT license as Flot. Not all of Flot’s functionality has been translated, but enough for it to be useful on Timetric.

I’ve made a start on the interactivity, which you can try on Timetric, but I’ll write about that in a future blog post.

What do you think? Leave your comments below!

Posted in plotting, user interface | 7 Comments