19 February 2007
23 March 2006
Open Data in the Age of Exponential Science
There's a very interesting article in this week's Nature, as part of its 2020 Computing Special (which miraculously is freely available even to non-subscribers), written by Alexander Szalay and Jim Gray.
I had the pleasure of interviewing Gray a couple of years back. He's a Grand Old Man of the computing world, with a hugely impressive curriculum vitae; he's also a thoroughly charming interviewee with some extremely interesting ideas. For example:I believe that Alan Turing was right and that eventually machines will be sentient. And I think that's probably going to happen in this century. There's much concern that that might work out badly; I actually am optimistic about it.
The Nature article is entitled "Science in an exponential world", and it considers some of the approaching problems that the vast scaling up of Net-based, collaborative scientific endeavour is likely to bring us in the years to come. Here's one key point:A collaboration involving hundreds of Internet-connected scientists raises questions about standards for data sharing. Too much effort is wasted on converting from one proprietary data format to another. Standards are essential at several levels: in formatting, so that data written by one group can be easily read and understood by others; in semantics, so that a term used by one group can be translated (often automatically) by another without its meaning being distorted; and in workflows, so that analysis steps can be executed across the Internet and reproduced by others at a later date.
The same considerations apply to all open data in the age of exponential science: without common standards that allow data from different groups, gathered at different times and in varying circumstance, to be brought together meaningfully in all sorts of new ways, the openness is moot.
Posted by Glyn Moody at 10:24 pm 0 comments
Labels: alan turing, alexander szalay, exponential science, jim gray, nature, netcraft, open data
13 December 2005
Driving Hard
Hard discs are the real engines of the computer revolution. More than rising processing speeds, it is constantly expanding hard disc capacity that has made most of the exciting recent developments possible.
This is most obvious in the case of Google, which now not only searches most of the Web, and stores its (presumably vast) index on cheap hard discs, but also offers a couple of Gbytes of storage to everyone who uses/will use its Gmail. Greatly increased storage has also driven the MP3 revolution. The cheap availability of Gigabytes of storage means that people can - and so do - store thousands of songs, and now routinely expect to have every song they want on tap, instantly.
Yet another milestone was reached recently, when even the Terabyte (=1,000 Gbytes) became a relatively cheap option. For most of us mere mortals, it is hard to grasp what this kind of storage will mean in practice. One person who has spent a lot of time thinking hard about such large-scale storage and what it means is Jim Gray, whom I had the pleasure of interviewing last year.
On his Web site (at Microsoft Research), he links to a fascinating paper by Michael Lesk that asks the question How much information is there in the world? (There is also a more up-to-date version available.) It is clear from the general estimates that we are fast approaching the day when it will be possible to have just about every piece of data (text, audio, video) that relates to us throughout our lives and to our immediate (and maybe not-so-immediate) world, all stored, indexed and cross-referenced on a hard disc somewhere.
Google and the other search engines already gives us a glimpse of this "Information At Your Fingertips" (now where did I hear that phrase before?), but such all-encompassing Exabytes (1,000,000 Terabytes) go well beyond this.
What is interesting is how intimately this scaling process is related to the opening up of data. In fact, this kind of super-scaling, which takes us to realms several orders of magnitude beyond even the largest proprietary holdings of information, only makes sense if data is freely available for cross-referencing (something that cannot happen if there are isolated bastions of information, each with its own gatekeeper).
Once again, technological developments that have been in train for decades are pushing us inexorably towards an open future - whatever the current information monopolists might want or do.
Posted by Glyn Moody at 10:35 am 0 comments
Labels: exabytes, gbytes, google gmail, hard discs, jim gray, michael lesk, microsoft research, open future, scaling, terabytes