2010-11-04

Scraping red dots

Over the past few weeks, I’ve been to a couple of events which have had a common theme of data - or more precisely, what effect access to data can have on the way our organisations behave. First was a quick trip to London for the October Facebook Developer Garage. The overarching theme last month was Government and Data, so there were various speakers from the intersection of technology geekery and policy wonkery that has sprung into existence over the last year or so.

One of the speakers was Chris Thorpe - I’m not quite sure how best to describe him, other than he makes and thinks about stuff online - and he was talking about how you make public data personal and pertinent. One of his examples was data from Guildford Borough Council, which he’d visualised by showing the ebbs and flows of spending over time.

When you think about it, it’s perfectly logical that spending on the Lido would be greater during the summer than the winter - but when you make those sort of visualisations, you also realise that there are spikes around the end of financial periods. Is that because this is when the invoices come in, or is it frantic efforts to empty the budget before the end of the financial year?

The point here is that raw statistics are all very well, but they’re only half the story. if your local hospital has had 300 cases of MRSA diagnosed in the last year, that might be a bad thing (particularly if it’s you that’s caught it). But whether it’s a disaster or a triumph depends on more information - if there were 3,000 cases the year before then you can argue that things are improving, whereas if it started from a base of 3, something’s gone wrong.

Which is why I’m also a bit wary of sites like Schooloscope. Although it’s built by people with brains the size of planets and the noblest of intentions, it still makes me slightly queasy that on the surface at least, every facet of a school’s existence is boiled down to a smiley face on the side of a cartoon building.

Yes, there IS much more nuanced information behind that, and yes, it’s not THAT difficult to find - but there’s still a snap smiley/frowny judgement plotted on a map. And the data is sourced from Ofsted - talk to virtually any teacher for more than a few minutes, and they’ll tell how that information is the product of the poster-child of box-ticking-as-a-substitute-for-dealing-with-the-underlying-issues approach.

Making the snap judgements, and not bothering to question the information is a symptom of what Schuyler Erle has called “red dot fever” - maps plastered in splodges of data points as an alternative to analysis. I’ve got a depressing feeling that there’s the same red dot fever when it comes to open public sector data - give a lazy journalist a Freedom of Information request and they’ll turn out 1,000 words of “why, oh why” about waste and inefficiencies.

And open data can be used as a weapon, too. I’m convinced that the raft of stories in the last few days about apparent junketing at the Audit Commission - and that organisation’s impending abolition - are completely un-coincidental.

Small wonder that some organisations are taking great lengths to ensure that their data is worse than useless - “it was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard’.”

Which is where the second of the two events comes in. Scraperwiki is the product of a startup that does something that on the face of it is relatively simple - it enables you to scrape websites to retrieve data. By exploiting the fact that the web is a structured medium, with a few clever tools and some persistence it’s possible to harvest data into a form which then becomes usable for analysis.

Scraperwiki does more than this, though - because it glues all the tools that you’ll need into a single cohesive application, it dramatically lowers the cost-of-entry to this kind of work. If you’re a half-competent geek, chances are you’ll be able to rig up a Ruby interpreter, an XML parser and a database so that you can pipe the results into Google Maps. But those are “10,000 hour skills”, to misquote Malcolm Gladwell - what Scraperwiki and tools like it do is to start to put those capabilities into the hands of civilians.

At the moment, there’s a culture change beginning to take place inside the public sector with data becoming seen as something that’s public property. It’s going to take a long time, because there’s a lot of cultural inertia to overcome. I think we’ll find that most of the data that gets release will tell the tale of decent people doing the best they can in trying circumstances. But some of it will shine some bright lights into some murky areas, so it must be tempting for organisations to think that they can limit the potential damage by burying the data away in obscure formats, or freezing it into a PDF.

But that’s a mistake, because it’ll cause an arms race. This is the kind of behaviour that your average geek sees as a challenge. Given enough time, neurons and lines of code, the information is going to get extracted, parsed, analysed and published - hiding it away is only postponing the inevitable.