Sphereless

Thursday, November 25, 2010

Structured data: Accessible magic?

Magic Numbers (source)

Have you ever dreamt in SQL? Few of us have, and we only ever speak of it in hushed, yet secretly astounded tones. Database development is a weird way of looking at the world. Those who venture into it too deeply may never come back to being "normal".

But at the same time, it's also like any language - reaching a state of fluidity involves a fundamental shift in thinking. To think in French is to adopt a new philosophy, and the same is true of database logic; linking and manipulating rows of data is a world away from editing each row by hand.

To explain the importance of structured data, is it important to first get across this conceptual paradigm shift? Is the ultimate draw of structured data tied inherently to a new way of seeing the world? One in which we, as data/content hackers don't see the data at all, but merely instruct a computer to do stuff with it.

Maybe this explains the "format divide" between those publishing in "closed" formats, and those publishing in open ones. If you do not know how to automate the data-munging process, then you do stuff by hand, you take a long time to do it, and you have absolutely no need for "structures" other than those in your head.

This happens everywhere, all the time: half the world lives in Excel during office hours. At some point, computers became popular as difference engines, but not necessarily good at being them. Human operators became part of the machines, rather than directors of them. In a way, this mirrored the huge factory production lines, and the endless supermarket checkouts, so most humans simply accepted this as the new way of life. Any sufficiently different technology is a form of magic. Processing as a manual task.

This happens today. Never forget that. Seeing data is more important than defining a structure for it, because structure is *hard*. Datasets have peculiarities, errors, and specifics that resist simple structuring. And changing these structures is effort - effort that involves communicating these changes to other. In short, it's easier to stick with "sloppy" data if you're not using the right tools. It's even easier if the other people using the data don't care about the tools either. Content is King.

So how do we bridge this divide between "manual data labour" and "magic"? On the up side, I believe it must happen, as data - and talk of data - becomes a public matter. Those not structuring their data will need to structure it, or face a new kind of exclusion - call it "un-APIness" perhaps.

But this doesn't help us to move into a culture of automation, of magic, which I think is important because it determines what you *believe* you can do with data. Understanding structured data is essential in coming up with new services, new applications, and new answers.

Working with people to build answers will help too. It's not enough to just want "raw data now" - to build bridges, we need to build real things based on data. We, as geeks, need to find out what people actually want. We need to show that questions can be answered with "magic", but also be open enough to demonstrate that structuring data has a direct impact on what can be done, and how quickly.

Change the tools. Rethink processes. It's time to end the conveyor-belt, factory line approach to data.

Sunday, January 24, 2010

UKGovCamp 2010 - a far-too-lengthy write-up

Yesterday was UK GovCamp 2010, a gathering of people interested in how (roughly defined) government can be taken forwards using the Internet. The day was crafted lovingly by Dave Briggs and hosted excitingly at Google HQ in London. Here's a quick rundown of where I was, what I saw, and what I'd attempt to think about the day after if I had any brain left.

Session 1 was an exercise in getting data geeks talking to data users - hypothetically, at least. The room broke into 5 or so groups, each looking at the problems that members of the public might face around certain topical issues, such as road chaos, sporting events, or sexual health. To get us thinking, we first asked what kind of information would the public want/need for each of these. In the second half, the data geeks migrated to a different group to see how data could help with answering this.

I'm not sure I came to any particular answers about either road chaos or sporting events, but did find it a useful way of breaking down the issue. Without realising it, I'd probably stumbled into the first recurring theme of my day - usability of data. Some notes of interest:

Data may exist in a central database, but that doesn't mean everyone will be accessing it for the same reason and/or/therefore by the same means. Different groups of people have different networks - football supporters might check their club site for news, for example. Local residents might check a council site, or a paper newsletter, or even just handy signs put up on the side of the road for future travel "alerts". A good reminder why data shouldn't be tied to a particular "portal".
It's far too easy to focus on using the latest devices to make getting data out easy. But that doesn't mean it reaches people we want it to reach. (One reason I'm so excited about Newspaper Club.)
We draw data from many, many different places to form a decision or an opinion, e.g. form local authorities, central figures, news, private sources, etc. Linked data is probably hugely important in joining all this up, but it's also a process that we, as humans, do naturally and constantly. I think there's a big question about how we tie these two worlds together. Too big for this post though.

I actually ended up kind-of starting up Session 2 in the end which, not having really done before, I wa slightly nervous about, but in the end am pretty pleased with how it went. I'd decided to try to get some discussion going about How to find and filter data, which I've been thinking about a bit after the recent attention around data.gov.uk. I think the debate wandered on to the meta-topic of how to describe data, and how to share those descriptions between organisations and viewpoints, but it's a good debate to have and people seemed genuinely engaged with it.

I started by taking people through what we'd done with data4nr.net in terms of UI, XML and tying it into external services like data.gov.uk. Most excellently, Richard Stirling was on hand to fill in about the latter, which probably helped to raise the issue of how we actually tie all this data together. Notes on all this below:

One thing that came out of the talk around data.gov.uk is where duplicates appear (as everyone is cataloguing data, with a fair bit of overlap), but without any real way of knowing so. Unique IDs are like, really, really important, but even the definition of one is subject to interpretation problems. Simon Field noted that some users, for example, want to treat amended data as a "new" dataset, while others don't. "Unique" is subjective, perhaps. I get the impression this is going to take a while to bash out.
Andrew Walkingshaw of Timetric (also one of the sponsors) noted two extremes of presenting data to people - "either lie to them, or freak them out". I think the extent to which either of these is necessary depends on who you're making the data public to - or, who is your audience? Different people have different training, and therefore different expectations about what the data represents. How do we manage this, or integrate it with our processes and applications?

Maybe not everyone needs to understand data - just those in the argument? e.g. if a journalist uses some data to come to a slightly ~~suspect~~ headline-grabbing conclusion, are there people who can re-run the data and verify that? Coming out of that, do we have forums where such verifications and/or dispute can be raised legitimately?

And to return to the idea of defining metadata, there is still a question about whether definitions should be "standardised" (i.e. everyone shares the same vocabulary), or if we accept that everyone has their own "language" and the challenge is to map between these somehow. If the former, is it practical to define one in advance, or just let people make their own, in a more organic nature?

I think there was lunch at this point.

Session 3 was on Using Wordpress in Government, run by Simon Dickson of Puffbox. I've been doing a fair bit of integrating PHP sites with Wordpress this year, so was interested in hearing about what other people had done with it, and how. A lot of the session seemed to be extolling the power of Wordpress rather than focus on the grittier details of rolling it into a project, process or workplace, but it was interesting to hear where it's being used, and a great chance to finally meet Steph Gray in person.

Good to note that about half of all (central?) government departments are "dipping their toes" into Wordpress, although perhaps under the second theme of the day - covert innovation which I'll pick up at the end.

Good point from Simon - that for all the talk about re-using software, making sites, etc, "Wordpress has done it - we are doing it." Good tools make exploration easy, and make it easier to experiment with little nuggets of progress without too much risk/cost/project management. We have good tools already that mostly just need tweaking, why not use them?

Wordpress is great for swapping content between sites, as everything is available as RSS feeds. I suspect this ties into my session on finding and filtering data more than I realise.

Session 4 saw Richard Stirling talk about his week launching data.gov.uk to way more attention than he and his team expected. The launch apparently saw an average of 6,000 visits over a 3 hour period, split across 4 servers. Amused that Richard was bemused why it was such a big story ("2nd most important bit of news on Working Lunch"). Maybe it's because the British are winning back the Internet, rah.

Finally, session 5 saw Steph Gray (slide here), Anthony Zacharzewski (links to slides) and Paul Clarke talking about persuading politicians and bureaucrats of the value of digital engagement. A great talk all round, with some inspiring, and almost crafty, thoughts being put forward about how to make websites and influence people:

Talk about activities, not tools. Talk about how what you want to do results in outcomes. Decision makers like to see a direct link between what you propose and what gets saved.

Use narratives, storytelling. But be careful about who you include in your stories - different viewpoints and people are perceived in different ways. Sometimes people love the idea of appealing to the "main in the street". Other times the same man is seen as, say, unreliable or anecdotal.

Terms and words are political, as I've noted before. Use terms, especially "buzzwords" carefully, as they may "belong" to particular groups. Technical speak suffers from the same problem, I'd say. WTF do AJAX, Web2.0 and WTF mean anyway?

Themes

The two recurring threads I really picked up on during the day were:

Usability of Data - How can we make data as a whole easier for everyone to find? How do we know what data is out there, what it means, and what it can/can't be used for? How can we access it other than clever websites?

Covert Innovation - A lot of the exciting stuff in government is being done "under the radar". This, in itself, is not necessarily a problem, but there were a couple of tales around the idea that successful efforts would be prevented if they were made more public - for various reasons. I think currently there are a lot of conversations going on, but within almost hushed tones - tones which can only get loud once this success has reached critical mass and gone "mainstream" to the point where it can't be covered up any more. The tales of Gordon Brown giving Tim Berners-Lee free reign were great, but really not enough. Hiring a hugely respected scientist is quite different to trusting your own staff.

Failure is an option, even necessary, but a lot of the time organisations believe that it isn't - perhaps because they're used to thinking in terms of large scale projects (= large scale failure)? Contrariwise, a lot of the efforts seen at GovCamp were small scale innovation which can and even should fail quickly and easily (e.g. "does this Wordpress plugin do what we want?" Click. Install. "No." Learn. Move on.) The move towards opening up data is all about risk management. Bang the rocks together.

OK, this post was a little longer than I thought it was, and now my stomach is rumbling. Cheers to all for a great day, and look forward to seeing the thoughts that take place in its aftermath. Keep the momentum.

Wednesday, July 01, 2009

Transparency should not be for Blame

I was going to write a small blog post, but Peter Kawalek's discourse on what isn't said around MP expenses says what I was going to say, and far, far more elegantly.

I'm encouraged by the flurry of interest and activity surrounding the release of expenses data, but at the same time I can't help but question whether it really matters after all that.

Did people really care when this all hit the headlines? I, for one, got the impression that the weeks of blathering waffle on the radio and in the papers was being drummed up and forced on to stage by either politicians wanting to embarrass other politicians, politicians wanting to un-embarrass themselves, or media outlets looking to embarrass politicians - which, incidentally, is like taking sweets from a baby in a sweet shop.

For everyone else, talk of expenses was dull, dull, dull, and generally a good excuse to flick channel, turn off the radio, or go and do something interesting like make pasta shakers.

The point of this rant is this then: Is it important to spend time and energy releasing the kind of data that, while ideas of transparency might be in the public eye, doesn't actually either a) contribute to our understanding of the state of things, or b) offer a positive solution?

After all, the main reason for releasing expenses data is to find people to point fingers at, rather than to actually applaud MPs for not spending money. (Personally, I'm thinking of sending my MP some better coffee than the Kenco stuff he orders...)

Data can be good. Transparency can be good. But shouldn't we be careful that we're not just opening up an attitude of blame culture? Can we avoid a society transparency and monitoring are no better than CCTV or a nanny state - a culture of wrist-slapping people for their mistakes, rather than encouraging and rewarding valued behaviour?

Monday, February 16, 2009

A Step-by-step Guide to Visualising Tweets

In response to psychemedia's request, I've put up a commented version of the code used to generate Wordles and Google Timelines.

The very rough and ready process was along the lines of:

Run the Perl script to output Tweets into CSV format. These will be ordered in forwards chronological order. The script also generates a count for the number of tweets in each 10 minute time slot, to give some idea of activity over the course of time.

Import the CSV into a fresh Google docs spreadsheet.

I found it handy to duplicate this sheet, to play with subsets of the data e.g. just tweets for certain days.

For the Wordles, I simply selected the relevant column (i.e. usernames, or tweet texts), and pasted into a decent text editor. I removed half the #ukgc09 tags for tweet texts, to stop it overpowering the rest. Then I just headed over to http://www.wordle.net/create, pasted the text, and played with formatting until I got something I like the look of.

For the Google Timeline, I created a version of the sheet with 4 columns: tweet date/time (to become the X axis), number of tweets for this 10-minute time band (for Y axis = activity), tweet author and tweet content (for the notations). Then it was just a case of opening the "Insert gadget" menu, choosing the "Interactive Time Series Chart" gadget, and setting the Range to include all the data in these columns.

I found it easiest to limit tweets to 3 days (otherwise the amount of notations causes the browser to get very slow), and to move the gadget to a separate sheet.

Then I published the whole document as a web page (see the "Share" menu in the top right of the spreadsheet - you need to publish the data for the chart to work).

And that's it. As yet, I haven't found a way to publish just the gadget - Google have code to embed it in a page, but this doesn't seem to work.

I'm pretty sure there's a whole lot more you could do - I was just intrigued as to how activity varied through the weekend. (Perhaps it's good that the wifi on the day was down - I'm not sure how well that Time Series gadget scales...) The Wordles also seem quite a nice way to remember the day. Perhaps it might be possible to generate a similar, animated version to view word/author proliferation throughout a day as well?

Anyway, if you have any questions, leave a comment or get in touch via Twitter.

Thursday, February 12, 2009

Are we competitors? Or collaborators?

This week I've managed to catch up a little with the Power of Information Taskforce "beta" report, geniously (not a real word) put into a Wordpress site allowing anyone to head over and comment on each section.

Today I also had a quick squint at the Digital Britain interim report - or the executive summary a least. Those of you wanting to check this one out in Wordpress will want to run over to the rather less official version at writetoreply.org.

Comments and access aside, what struck me was the division in attitude taken by each. On the one hand, the POIT report seems to be about working out how we can start opening the doors, giving people access to data, encouraging experimentation, and shift data from where it's created to where people want it.

On the other hand, the exec summary for the Digital Britain report seems ensconced in the idea that we need to keep up with other countries - or, preferably, lead them in all the league tables we're able to league in.

From experience and instinct, it seems to me that those who are open to collaboration are more likely to a) produce things and b) collaborate again in future. On the other hand, seeing the world as a race just means we worry more about how we're doing in relation to others, rather than in relation to ourselves.

Perhaps, in other words, we're falling behind digitally precisely because we want to keep up with others, rather than work on what actually needs fixing, and what people actually want. Competition gives us excuses, collaboration gives us energy.

If we're going to change things, we need to start seeing everyone - and by this I mean our neighbours, our councillors, our politicians, and our friends in other countries and other industries - as potential energy, as possible links. We need more collaboration, but as long as we think of collaborators as potential competitors in some made-up chart that really doesn't mean anything, we're never going to seize the full potential of those links.

I'm not saying competition is a lie, or isn't sometimes useful. Just that being "better" than others shouldn't be our motivation - we should instead simply try to be better than ourselves.

Progress doesn't care about league tables.

Monday, February 09, 2009

Playing with UKGC09 tweets and Wordle/Google

As I said before, UKGovCamp '09 was very inspiring - and got me thinking that in this day of mash-ups and widget-gidgets, one probably doesn't need to actually do much coding at all. I put my theory to the test, wrote a quick Perl script to chirp all the #ukgc09 tweets for the day of the event, the day before, and the day after, and shoe-horned the data into a couple of places which could be prettier if I had more time.

Here's a Wordle tag cloud of all tweets over the 3-day period:

(Click to view large, you'll need Java.)

For fun, here's a similar Wordle for the Tweet authors themselves:

Finally, here's a time-series plot of all the tweets over the 3 day period, on a Google graph so you can zoom in and out and click on tweets. The y-axis indicates roughly how much twitter activity there is in a 5-minute slot.

All good, clean fun.

Update Feb 16th: Here's the code I used to grab the #ukgc09 tweets - fairly simple Perl, requires a system with the JSON library and command line Curl installed (see code for links). I used a Mac with curl already installed, and added the JSON library using "perl -MCPAN -e 'install JSON'" from the terminal. YMMV.

Saturday, February 07, 2009

Post-GovCamp Thoughts: What is Trust?

Rather than re-cap my sporadic notes from UK GovWeb Barcamp '09, I thought I'd try to pick out some of the more intriguing thoughts that occurred to me during the day. This is the first, and you should be able to track them through the "ukgc09" label below...

What is Trust?

Everyone agrees that Trust is Good. But can we really leave it at that? Trust in what? Why? And how? There are, I think, different answers to each of these, and those answers depend - or inform - the type of political system in play.

Trust seems to overlap a lot with Transparency these days. But I'm not sure it's as simple as that. Take a simple analogy - would you trust your friends, even if hey didn't tell you what you were doing? In fact, wouldn't you have to trust them if you couldn't see what they were doing? Is that the definition of trust? And if so, then...

a) Can we really talk about "trust" in a political system that encourages transparency? Does transparency come about precisely because we can't trust our politicians?

b) Why do we trust our friends, or others that we deem to be "trustworthy"? If we want to trust our politicians (because, let's face it, we don't just want to watch over them all the time like some kind of nanny - we want them to get on the job we've entrusted them with), then how do we go about it? What systems do we need in place to build that trust?

Maybe Trust is a judgement, based on experience, character, reputation. Sometimes we get it wrong, and someone pulls the wool over our eyes. Sometimes we need evidence to start trusting someone again. But I'm not convinced that rushing to more "openness" and transparency is necessarily the best answer. We just need to be more careful about who we trust, and ask why our eyes were covered. And we need to force politicians who do betray our trust to prove they can be trusted again, but not through openness.

We have openness. And yet we still have no trust.