13 March 2006

OU on UK ID DBs

Talking of the Open University, here's an interesting research report from them on the UK Government's plans to introduce ID cards. The study looks at things from a slightly novel angle: people's attitudes to the scheme, and how they vary according to the details.

The most interesting result was that even those moderately in favour of the idea became markedly less enthusiastic when the card was compulsory and a centralised rather than distributed database was used to store the information. Since this is precisely what the government is planning to do, the research rather blows a hole in their story that the British population is simply begging them to introduce ID cards. John Lettice has provided more of his usual clear-headed analysis on the subject.

What is also fascinating is how the British public - or at least the sample interviewed - demonstrated an innate sense of how unwise such a centralised database would be. I think this argues a considerable understanding of what is on the face of it quite an abstract technical issue. There's hope yet - for the UK people, if not for the UK Government....

12 March 2006

Mozart the Blogger

To celebrate the 250th anniversary of Mozart's birth, I've been reading some of his letters, described by Einstein (Alfred, not his cousin Albert) as "the most lively, the most unvarnished, the most truthful ever written by a musician". It is extraordinary to think that these consist of the actual words that ran through Mozart's head, probably at the same time when he was composing some masterpiece or other as a background task. To read them is to eavesdrop on genius.

The other striking thing about them is their volume and detail. Mozart was an obsessive letter-writer, frequently knocking out more than one a day to his wide range of regular correspondents. And these are no quick "having a lovely time, wish you were here" scribbles on the back of a postcard: they often run to many pages, and consist of extended, complex sentences full of dazzling wordplay, describing equally rich ideas and complicated situations, or responding in thoughtful detail to points made in the letters he received.

Because they are so long, the letters have a strong sense of internal time: that is, you feel that the end of the letter is situated later than the beginning. As a result, his letters often function as a kind of diary entry, a log of the day's events and impressions - a kind of weblog without the reverse chronology (and without the Web).

Mozart was a blogger.

If this intense letter-writing activity can be considered a proto-blog, the corollary is that blogs are a modern version of an older epistolary art. This is an important point, because it addresses two contemporary concerns in one fell swoop: that the art of the letter is dead, and that there is a dearth of any real substance in blogs.

We are frequently told that modern communications like the telephone and email have made the carefully-weighed arrangement of words on the page, the seductive ebb and flow of argument and counter-argument, redundant in favour of the more immediate, pithier forms. One of the striking things about blogs is that some - not all, certainly - are extremely well written. And even those that are not so honed still represent considerable effort on the part of their authors - effort that 250 years ago was channelled into letters.

This means that far from being the digital equivalent of dandruff - stuff that scurfs off the soul on a daily basis - the growing body of blog posts represents a renaissance of the art of letter-writing. In fact, I would go further: no matter how badly written a blog might be, it has the inarguable virtue of being something that is written, and then - bravely - made public. As such, it is another laudable attempt to initiate or continue a written dialogue of a kind that Mozart would have understood and engaged with immediately. It is another brick - however humble - in the great edifice of literacy.

For this reason, the current fashion to decry blogs as mere navel-gazing, or vacuous chat, is misguided. Blogs are actually proof that more and more people - 30,000,000 of them if you believe Technorati - are rediscovering the joy of words in a way that is unparalleled in recent times. We may not all be Mozarts of the blog, but it's better than silence.

11 March 2006

Open University Meets Open Courseware

Great news (via Open Access News and the Guardian): the Open University is turning a selection of its learning materials into open courseware. To appreciate the importance of this announcement, a little background may be in order.

As its fascinating history shows, the Open University was born out of Britain's optimistic "swinging London" culture of the late 1960s. The idea was to create a university open to all - one on a totally new scale of hundreds of thousands of students (currently there are 210,000 enrolled). It was evident quite early on that this meant using technology as much as possible (indeed, as the history explains, many of the ideas behind the Open University grew out of an earlier "University of the Air" idea, based around radio transmissions.)

One example of this is a close working relationship with the BBC, which broadcasts hundreds of Open University programmes each week. Naturally, these are open to all, and designed to be recorded for later use - an early kind of multimedia open access. The rise of the Web as a mass medium offered further opportunities to make materials available. By contrast, the holdings of the Open University Library require a username and password (although there are some useful resources available to all if you are prepared to dig around).

Against this background of a slight ambivalence to open access, the announcement that the Open University is embracing open content for at least some of its courseware is an extremely important move, especially in terms of setting a precedent within the UK.

In the US, there is already the trail-blazing MIT OpenCourseWare project. Currently, there are materials from around 1250 MIT courses, expected to rise to 1800 by 2007. Another well-known example of open courseware is the Connexions project, which has some 2900 modules. This was instituted by Rice University, but now seems to be spreading ever wider. In this it is helped by an extremely liberal Creative Commons licence, that allows anyone to use Connexions material to create new courseware. MIT uses a Creative Commons licence that is similar, except it forbids commercial use.

At the moment, there's not much to see at the Open University's Open Content Initiative site. There is an interesting link is to information from the project's main sponsor, the William and Flora Hewlett Foundation, about its pioneering support for open content. This has some useful links at the foot of the page to related projects and resources.

One thing the Open University announcement shows is that open courseware is starting to pick up steam - maybe a little behind the related area of open access, but coming through fast. As with all open endeavours, the more there are, the more evident the advantages of making materials freely available becomes, and the more others follow suit. This virtuous circle of openness begetting openness is perhaps one of the biggest advantages that it has over the closed, proprietary alternatives, which by their very nature take an adversarial rather than co-operative approach to those sharing their philosophy.

09 March 2006

RIAA Fights to the Death for DRM - Your Death

The ever-perceptive Ed Felten has an amazing story about the Record Industry Association of America (RIAA) and its friends-in-copyright fighting to keep DRM on people's systems in all circumstances - even those that might be life-threatening. From his post:

In order to protect their ability to deploy this dangerous DRM, they want the Copyright Office to withhold from users permission to uninstall DRM software that actually does threaten critical infrastructure and endanger lives.

In fact, it's enough to gaze (not too long, mind) at the RIAA's home page: it is a cacophony of "lawsuits", "penalties", "pirates", "theft" and "parental advisories" - a truly sorry example of narrow-minded negativity. Whatever happened to music as one of the loftiest expressions of the human spirit?

Savonarola, St. Francis - or St. IGNUcius?

There's a well-written commentary on C|Net that makes what looks like a neat historical parallel between Savonarola and Richard Stallman; in particular, it wants us to consider the GPL 3 as some modern-day equivalent of a Bonfire of the Vanities, in which precious objects were consigned to the flames at the behest of the dangerous and deranged Savonarola.

It's a clever comparison, but it suffers from a problem common to all clever comparisons: they are just metaphors, not cast-iron mathematical isomorphisms.

For example, I could just as easily set up a parallel between Stallman and St. Francis of Assisi: both renounced worldy goods, both devoted themselves to the poor, both clashed with the authorities on numerous occasions, and both produced several iterations of their basic tenets. And St. Francis never destroyed, as Savonarola did: rather, he is remembered for restoring ruined churches - just as Stallman has restored the ruined churches of software.

In fact, Stallman is neither Savonarola nor St. Francis, but his own, very special kind of holy man: St. IGNUcius of the Church of Emacs.

The Dream of Open Data

Today's Guardian has a fine piece by Charles Arthur and Michael Cross about making data paid for by the UK public freely accessible by them. But it goes beyond merely detailing the problem, and represents the launch of a campaign called "Free Our Data". It's particularly good news that the unnecessary hoarding of data is being addressed by a high-profile title like the Guardian, since a few people in the UK Government might actually read it.

It is rather ironic that at a time when nobody outside Redmond disputes the power of open source, and when open access is almost at the tipping point, open data remains something of a distant dream. Indeed, it is striking how advanced the genomics community is in this respect. As I discovered when I wrote Digital Code of Life, most scientists in this field have been routinely making their data freely available since 1996, when the Bermuda Principles were drawn up. The first of these stated:

It was agreed that all human genomic sequence information, generated by centres funded for large-scale human sequencing, should be freely available and in the public domain in order to encourage research and development and to maximise its benefit to society.

The same should really be true for all kinds of large-scale data that require governmental-scale gathering operations. Since they cannot be feasibly gathered by private companies, such data ends up as a government monopoly. But trying to exploit that monopoly by crudely over-charging for the data is counter-productive, as the Guardian article quantifies. Let's hope the campaign gathers some momentum - I'll certainly being doing my bit.

Update: There is now a Web site devoted to this campaign, including a blog.

Enter the Splogfighter

Talking of splogs, I came across (via SEO Data) the valiant Splogfighter's Blogger-based anti-splog blog. All power to whatever part of the virtual anatomy he/she/it uses in this laudable effort.

08 March 2006

Splog in a Box?

A long time ago, in a galaxy far away - well, in California, about 1994 - O'Reilly came out with something called "Internet in a Box". This wasn't quite the entire global interconnect of all networks in a handy cardboard container, but rather a kind of starter kit for Web newbies - and bear in mind that in those days, the only person who was not a newbie was Tim (not O'Reilly, the other one).

Two components of O'Reilly's Internet in a Box were particularly innovative. One was Spry Mosaic, a commercial version of the early graphical Web browser Mosaic that arguably began the process of turning the Web into a mass medium. Mosaic had two important offspring: Netscape Navigator, created by some of the original Mosaic team, and its nemesis, Internet Explorer. In fact, if you choose the "About Internet Explorer" option on the Help menu of any version of Microsoft's browser, you will see to this day the surprising words:

Based on NCSA Mosaic. NCSA Mosaic(TM); was developed at the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign.
Distributed under a licensing agreement with Spyglass, Inc.

So much for Bill Gates inventing the Internet....

The other novel component of "Internet in a Box" was the Global Network Navigator. This was practically the first commercial Web site, and certainly the first portal: it was actually launched before Mosaic 1.0, in August 1993. Unfortunately, this pioneering site was later sold to AOL, where it sank without trace (as most pioneers do when they are sold to AOL: anybody remember the amazing Internet search company WAIS? No, I thought not.)

Given this weight of history, it seems rather fitting that something called Boxxet should be announced at the O’Reilly Emerging Technology Conference, currently running in San Diego. New Scientist has the details:

A new tool offers to create websites on any subject, allowing web surfers to sit back, relax and watch a virtual space automatically fill up with relevant news stories, blog posts, maps and photos.

The website asks its users to come up with any subject they are interested in, such as a TV show, sports team or news topic, and to submit links to their five favourite news articles, blogs or photos on that subject. Working only from this data, the site then automatically creates a webpage on that topic, known as a Boxxet. The name derives from “box set”, which refers to a complete set CDs or DVDs from the same band or TV show.

As this indicates, Boxxet is a kind of instant blog - just add favourite links and water. It seems the perfect solution for a world where people are so crushed by ennui that most bloggers can't even be bothered posting for more than a few weeks. Luckily, that's what we have technology for: to spare us all those tiresome activities like posting to blogs, walking to the shops or changing television channels by getting up and doing it manually.

It's certainly a clever idea. But I just can't see myself going for this Blog in a Box approach. Perhaps I over-rate the specialness of my merely human blogging powers; perhaps I just need to wait until the Singularity arrives in a few years time, and computers are able to produce trans-humanly perfect blogs.

What I can see - alas - are several million spammers rubbing their hands with glee at the thought of a completely automatic way of generating spurious, self-updating blogs. Not so much Blog in a Box as Splog in a Box.

07 March 2006

The Other Grid God: Open Source

As I was browsing through Lxer.com, my eye caught this rather wonderful headline: "Grid god to head up Chicago computing institute". The story explains that Ian Foster, one of the pioneers in the area of grid computing (and the grid god in question), is moving to the Computation Institute (great name - horrible Web site).

Grid computing refers to the seamless linking together across the Internet of physically separate computers to form a huge, virtual computer. It's an idea that I've been following for some time, not least because it's yet another area where free software trounces proprietary solutions.

The most popular toolkit for building grids comes from the Globus Alliance, and this is by far the best place to turn to find out about the subject. For example, there's a particularly good introduction to grid computing's background and the latest developments.

The section dealing with grid architecture notes that there is currently a convergence between grid computing and the whole idea of Web services. This is only logical, since one of the benefits of having a grid is that you can access Web services across it in a completely transparent way to create powerful virtual applications running on massive virtual hardware.

The Globus Alliance site is packed with other resources, including a FAQ, a huge list of research papers on grids and related topics, information about the Globus Toolkit, which lets you create grids, and the software itself.

Open source's leading position in the grid computing world complements a similar success in the related field of supercomputing. As this chart shows, over 50% of the top 500 supercomputers in the world run GNU/Linux; significantly, Microsoft Windows does not even appear on the chart.

This total domination of top-end computing - be it grids or supercomputers - by open source is one of the facts that Microsoft somehow omits to tell us in its "Get The Facts" campaign.

06 March 2006

Blogging Newspapers

One of the interesting questions raised by the ascent of blogs is: What will the newspapers do? Even though traditional printed titles are unlikely to disappear, they are bound to change. This post, from the mysteriously-named "Blue Plate Special" blog (via C|Net's Esoteric blog) may not answer that question, but it does provide some nutritious food for thought.

It offers its views on which of the major US dailies blog best, quantified through a voting system. Although interesting - and rich fodder for those in need of a new displacement activity - the results probably aren't so important as the criteria used for obtaining them. They were as follows:

Ease-of-use and clear navigation
Currency
Quality of writing, thinking and linking
Voice
Comments and reader participation
Range and originality
Explain what blogging is on your blogs page
Show commitment

The blog posting gives more details on each, but what's worth noting is that most of these could be applied to any blog - not just those in newspapers. Having recently put together my own preliminary thoughts on the Art of the Blog, I find that these form a fascinating alternative view, and with several areas of commonality. I strongly recommend all bloggers to read the full article - whether or not you care about blogging newspapers.

05 March 2006

Google Googlied by Spaiku Adages

Today was a black day in the annals of my Gmail account: I received my first piece of spam. You might think I should be rejoicing that I've only ever received one piece of spam, but bear in mind that this is a relatively new account, and one that I've not used much. Moreover, Gmail comes with spam filtering as standard: you might hope that Google's vast computing engines would be able consistently to spot spam.

So far they have: the spam bucket of my account lists some 42 spam messages that Google caught. The question is: why did Google get googlied by this one? It's not particularly cunning: it has the usual obfuscated product names (it's one of those), with some random characters and the usual poetic signoff.

Actually, now that I come to check, this turns out to be slightly special:

Work first and then rest.
Actions speak louder than words.
Old head and young hand.

Maybe this is Gmail's Achille's Heel: it is defenceless in the face of spam haiku (spaiku?) adages.

04 March 2006

The European Digital Library: Dream, but Don't Touch

With all the brouhaha over the Google Book Search Library Project, it is easy to overlook other efforts directed along similar lines. I'm certainly guilty of this sin of omission when it comes to The European Library, about which I knew nothing until very recently.

The European Library is currently most useful for carrying out integrated searches across many European national libraries (I was disappointed to discover that neither Serbia nor Latvia has any of my books in their central libraries). Its holdings seem to be mainly bibliographic, rather than links to the actual text of books (though there are some exceptions).

However, a recent press release from the European Commission seems to indicate that The European Library could well be transmogrified into something altogether grander: The European Digital Library. According to the release:

At least six million books, documents and other cultural works will be made available to anyone with a Web connection through the European Digital Library over the next five years. In order to boost European digitisation efforts, the Commission will co-fund the creation of a Europe-wide network of digitisation centres.

Great, but it adds:

The Commission will also address, in a series of policy documents, the issue of the appropriate framework for intellectual property rights protection in the context of digital libraries.

Even more ominously, the press release concludes:

A High Level Group on the European Digital Library will meet for the first time on 27 March 2006 and will be chaired by Commissioner Reding. It will brings together major stakeholders from industry and cultural institutions. The group will address issues such as public-private collaboration for digitisation and copyrights.

"Stakeholders from industry and cultural institutions": but, as usual, nobody representing the poor mugs who (a) will actually use this stuff and (b) foot the bill. So will our great European Digital Library be open access? I don't think so.

The Amazing Amazon Mechanical Turk

OK, so I may be well behind the times, but I still found this rather amazing when I came across it. Not so much for what it is - a version of Google Answers - but for the fact that Amazon is doing it.

Google I can understand: its Answers service is reaching the parts its other searches cannot - a complement to the main engine (albeit a tacit admission of defeat on Google's part: resorting to wetware, whatever next?). But Amazon? What has a people-generated answer service got to do with selling things? Come on Jeff, focus.

Cool name, though.

Digg This, It's Groovy

Digg.com is a quintessentially Web 2.0 phenomenon: a by-the-people, for-the-people version of Slashdot (itself a keyWeb 1.0 site). So Digg's evolution is of some interest as an example of part of the Net's future inventing itself.

A case in point is the latest iteration, which adds a souped-up comment system (interestingly, this comes from the official Digg blog, which is on Blogger, rather than self-hosted). Effectively, this lets you digg the comments.

An example is this story: New Digg Comment System Released!, which is the posting by Kevin Rose (Digg's founder) about the new features. Appropriately enough, this has a massive set of comments (nearly 700 at the time of writing).

The new system's not perfect - for example, there doesn't seem to be any quick way to roll up comments which are initially hidden (because they have been moderated away), but that can easily be fixed. What's most interesting is perhaps the Digg sociology - watching which comments get stomped on vigorously, versus those that get the thumbs up.

Tying the Kangaroo Down

If any proof were needed that some people still don't really get the Internet, this article is surely it. Apparently Australia's copyright collection agency wants schools to pay a "browsing fee" every time a teacher tells students to browse a Web site.

Right.

So, don't tell me: the idea is to ensure that students don't use the Web, and that they grow up less skilled in the key enabling technology of the early twenty-first century, that they learn less, etc. etc. etc.?

Of course, the fact that more and more content is freely available under Creative Commons licences, or is simply in the public domain, doesn't enter into the so-called "minds" of those at the copyright collection office. Nor does the fact that by making this call they not only demonstrate their extraordinary obtuseness, but also handily underline why copyright collection agencies are actually rather irrelevant these days. And that rather than waste schools' time and money paying "browsing fees", Australia might perhaps do better to close down said irrelevant, clueless copyright office, and save some money instead?

03 March 2006

Beyond Parallel Universes

One of the themes of this blog is the commonality between the various opens. In a piece I wrote for the excellent online magazine LWN.net, I've tried to make some of the parallels between open source and open access explicit - to the point where I set up something of a mapping between key individuals and key moments (Peter Suber at Open Access News even drew a little diagram to make this clearer).

My article tries to look at the big picture, largely because I was trying to show those in the open source world why they should care about open access. At the end I talk a little about specific open source software that can be used for open access. Another piece on the Outgoing blog (subtitle: "Library metadata techniques and trends"), takes a closer look at a particular kind of such software, that for repositories (where you can stick your open access materials).

This called forth a typically spirited commentary from Stevan Harnad, which contains a link to yet more interesting words from Richard Poynder, a pioneering journalist in the open access field, with a blog - called "Open and Shut" (could there be a theme, here?) - that is always worth taking a look at. For example, he has a fascinating interview on the subject of the role of open access in the humanities.

Poynder rightly points out that there is something a contradiction in much journalistic writing about open access, in that it is often not accessible itself (even my LWN.net piece was subscribers-only for a week). And so he's bravely decided to conduct a little experiment by providing the first section of a long essay, and then asking anyone who reads it - it is freely accessible - and finds it useful to make a modest donation. I wish him well, though I fear it may not bring him quite the income he is hoping for.

01 March 2006

There's No INSTEDD without Open Access

An interesting story in eWeek.com. Larry Brilliant, newly-appointed head of the Google.org philanthropic foundation, wants to set up a dedicated search engine that will spot incipient disease outbreaks.

The planned name is INSTEDD: International Networked System for Total Early Disease Detection - a reference to the fact that it represents an alternative option to just waiting for cataclysmic infections - like pandemics - to happen. According to the article:

Brilliant wants to expand an existing web crawler run by the Canadian government. The Global Public Health Intelligence Network monitors about 20,000 Web sites in seven languages, searching for terms that could warn of an outbreak.

What's interesting about this - apart from the novel idea of spotting outbreaks around the physical world by scanning the information shadow they leave in the digital cyberworld - is that to work it depends critically on having free access to as much information and as many scientific and medical reports as possible.

Indeed, this seems a clear case where it could be claimed that not providing open access in relevant areas - and the range of subjects that are relevant is vast - is actually endangering the lives of millions of people. Something for publishers and their lawyers to think about, perhaps.

Higgins: Social Web, Social Commerce

Identity is a slippery thing at the best of times. On the Internet it's even worse (as the New Yorker cartoon famously encapsulated). But identity still matters, and sorting it out is going to be crucial if the Internet is to continue moving into the heart of our lives.

Of course, defining local solutions is easy: that's why you have to remember 33 different passwords for 33 different user accounts (you do change the password for each account, don't you?) at Amazon.com and the rest. The hard part is creating a unitary system.

The obvious way to do this is for somebody to step forward - hello Microsoft Passport - and to offer to handle everything. There are problems with this approach - including the tasty target that the central identity stores represent for ne'er-do-wells (one reason why the UK Government's proposed ID card scheme is utterly idiotic), and the concentration of power it creates (and Microsoft really needs more power, right?).

Ideally, then, you would want a completely modular, decentralised approach, based on open source software. Why open source? Well, if it's closed source, you never really know what it's doing with your identity - in the same way that you never really know what closed software in general is doing with your system (spyware, anyone?).

Enter Higgins, which not only meets those requirements, but is even an Eclipse project to boot. As the goals page explains:

The Higgins Trust Framework intends to address four challenges: the lack of common interfaces to identity/networking systems, the need for interoperability, the need to manage multiple contexts, and the need to respond to regulatory, public or customer pressure to implement solutions based on trusted infrastructure that offers security and privacy.

Perhaps the most interesting of these is the "multiple contexts" one:

The existence of common identity/networking framework also makes possible new kinds of applications. Applications that manage identities, relationships, reputation and trust across multiple contexts. Of particular interest are applications that work on behalf of a user to manage their own profiles, relationships, and reputation across their various personal and professional groups, teams, and other organizational affiliations while preserving their privacy. These applications could provide users with the ability to: discover new groups through shared affinities; find new team members based on reputation and background; sort, filter and visualize their social networks. Applications could be used by organizations to build and manage their networks of networks.

The idea here seems to be a kind of super-identity - a swirling bundle of different cuts of your identity that can operate according to the context. Although this might lead to fragmentation, it would also enable a richer kind of identity to emerge.

As well as cool ideas, Higgins also has going for it the backing of some big names: according to this press release, those involved include IBM, Novell, the startup Parity Communications (Dyson Alert: Esther's in on this one, too) and the Berkman Center for Internet & Society at Harvard Law School.

The latter is also involved in SocialPhysics.org, whose aim is

to help create a new commons, the "social web". The social web is a layer built on top of the Internet to provide a trusted way to link people, organizations, and concepts. It will provide people more control over their digital identities, the ability to more easily find other people and groups, and more control over how they are seen by others across diverse contexts.

There is also a blog, called Social Commerce, defined as "e-commerce + social networking + user-centric identity". There are lots of links here, as well as on the SocialPhysics site. Clearly there's much going on in this area, and I'm sure I'll be returning to it in the future.

28 February 2006

Open Source, Opener Source

Brian Behlendorf is an interesting individual: one of those quietly-spoken but impressive people you meet sometimes. When I talked to him about the birth of Apache - which he informed me was not "a patchy" server, as the folklore would have it, just a cool name he rather liked - he was working at CollabNet, which he had helped to found.

He's still there, even though the company has changed somewhat from those early days. But judging from a characteristically thought-provoking comment reported in one of ZDNet's blogs, he's not changed so much, and is still very much thinking and doing at the leading edge of open source.

In the blog, he was reported as saying that he saw more and more "ordinary people" being enfranchised as coders thanks to the new generation of programming models around - notably Ajax and the mashup - that made the whole process far easier and less intimidating.

If he's right, this is actually a profound shift, since ironically the open source model has been anything but open to the general public. Instead, you had to go through a kind of apprenticeship before you could stalk the hallowed corridors of geek castle.

If open really is becoming opener, it will be no bad thing, because the power of free software depends critically on having lots of programmers, and a good supply of new ones. Anything that increases the pool from which they can be drawn will have powerful knock-on effects.

Blogroll, Drumroll

This fellow Blogger blogger is well worth taking a look at if you're interested in science and technology (well, that's everybody, isn't it?). Not so much for the blog entries - which are interesting enough - but for the astonishing blogroll, which includes several hundred links to a wide variety of interesting-looking sites, all neatly categorised. I have literally never seen anything like it - but maybe I'm just provincial.

I found Al Fin - for such is its suitably gnomic moniker - through another site that is worth investigating. Called Postgenomic (well, with a name like that, I had to take a look), it "collates posts from life science blogs and then does useful and interesting things with that data" according to the site. The "interesting things" seem to amount largely to collation of data from various source; there is also a good blog list, though not quite as impressive as Al Fin's.

Wanted: a Rosetta for the MegaWikipedia

As I write, Wikipedia has 997,131 articles - close to the magic, if totally arbitrary, one million (if we had eleven fingers, we'd barely be halfway to the equally magic 1,771,561): the MegaWikipedia.

Except that it's not there, really. The English-language Wikipedia may be approaching that number, but there are just eight other languages with more than 100,000 articles (German, French, Italian, Japanese, Dutch, Polish, Portuguese and Swedish), and 28 with more than 10,000. Most have fewer than 10,000.

Viewed globally, then, Wikipedia is nowhere near a million articles on average, across the languages.

The disparity between the holdings in different languages is striking; it is also understandable, given the way Wikipedia - and the Internet - arose. But the question is not so much Where are we? as Where do we go from here? How do we bring most of the other Wikipedias - not just five or six obvious ones - up to the same level of coverage as the English one?

Because if we don't, Wikipedia will never be that grand, freely-available summation of knowledge that so many hope for: instead, it will be a grand, freely-available summation of knowledge for precisely those who already have access to much of it. And the ones who actually need that knowledge most - and who currently have no means of accessing it short of learning another language (for which they probably have neither the time nor the money) - will be excluded once more.

Clearly, this global Wikipedia cannot be achieved simply by hoping that there will be enough volunteers to write all the million articles in all the languages. In any case, this goes against everything that free software has taught us - that the trick is to build on the work of others, rather than re-invent everything each time (as proprietary software is forced to do). This means that most of the articles in non-English tongues should be based on those in English. Not because English is "better" as a language, or even because its articles are "better": but simply because they are there, and they provide the largest foundation on which to build.

A first step towards this is to use machine translations, and the new Wikipedia search engine Qwika shows the advantages and limitations of taking this approach. Qwika lets you search in several languages through the main Wikipedias and through machine-translations of the English version. In effect, it provides a pseudo-conversion of the English Wikipedia to other tongues.

But what is needed is something more thoroughgoing, something formal - a complete system for expediting the translation of all of the English-language articles into other languages. And not just a few: the system needs to be such that any translator can use it to create new content based on the English items. The company behind Ubuntu, Canonical, already has a system that does something similar for people who are translating open source software into other languages. It's called, appropriately enough, Rosetta.

Now that the MegaWikipedia is in sight - for Anglophones, at least - it would be the perfect time to move beyond the succerssful but rather ad hoc approach currently taken to creating multilingual Wikipedia content, and to put the Net's great minds to work on the creation of something better - something scalable: a Rosetta for the MegaWikipedia.

What better way to celebrate what is, for all the qualifications above, truly a milestone in the history of open content, than by extending it massively to all the peoples of the world, and not just to those who understand English?

27 February 2006

(B)looking Back

I wondered earlier whether blogified books were bloks or blooks, and the emerging view seems to be the latter, not least because there is now a Blooker Prize, analogous to the (Man)Booker Prize for dead-tree stuff.

I was delighted to find that the Blooker is run by Lulu.com, discussed recently by Vic Keegan in the Guardian. Lulu is essentially micro-publishing, or publishing on demand: you send your digital file, they send the physical book - as many or as few copies as you like. You can also create music CDs, video DVDs and music downloads in the same way; Lulu.com handles the business end of things, and takes a cut for its troubles.

Nonetheless, the prices are extremely reasonable - if you live in the US: as Vic points out, the postage costs for books, for example, tend to nullify the attractiveness of this approach for anyone elsewhere in the world, at least from a financial point of view. But I don't think that this will be a problem for long. For Lulu.com is the brainchild of Bob Young, the marketing brains behind Red Hat, still probably the best-known GNU/Linux distribution for corporates.

I emphasise the marketing side, since the technical brains behind the company was Marc Ewing, who also named the company. As he explained to me when I was writing Rebel Code:

In college I used to wear my grandfather's lacrosse hat, which was red and white striped. It was my favourite hat, and I lost it somewhere in Philadelphia in my last year. I named the company to memorialise the hat. Of course, Red and White Hat Software wasn't very catchy, so I took a little liberty.

Young, a Canadian by birth, was the perfect complement to the hacker Ewing. He is the consummate salesmen, constantly on the lookout for opportunities. His method is to get close to his customers, to let them tell him what he should be selling them. The end-result of this hands-on approach was that he found himself in the business of selling a free software product: GNU/Linux. It took him a while to understand this strange, topsy-turvy world he tumbled into, but being a shrewd chap, and a marketeer of genius, he managed it when he finally realised:

that the one unique benefit that was creating this enthusiasm [for GNU/Linux] was not that this stuff was better or faster or cheaper, although many would argue that it is all three of those things. The one unique benefit that the customer gets for the first time is control over the technology he's being asked to invest in.

Arguably it was this early understanding of what exactly he was selling - freedom - that helped him make Red Hat the first big commercial success story of the open source world: on 11 August 1999, the day of its IPO, Red Hat's share went from $14 to $52, valuing the company that sold something free at $3.5 billion.

It also made Young a billionaire or thereabouts. He later left Red Hat, but has not lost the knack for pursuing interesting ideas. Even if Lulu.com isn't much good for those of us on the wrong side of the Atlantic, it can only be a matter of time before Bob listens to us Brit users (to say nothing of everyone else in the world outside the US) and puts some infrastructure in place to handle international business too.

26 February 2006

The First Blogger - and His Chaos

Wandering around the Net (as one does) I came across this: certainly one of the least-attractive sites that I've seen in a long time. But as soon as I noticed that familiar face in the top left-hand corner, I knew where I was: back in Chaos Manor.

Now, for younger readers, those two words might not mean much, but for anyone privileged enough to have lived through the early years of the microcomputing revolution, as chronicled in Byte magazine (now a rather anonymous Web site), they call forth a strange kind of appalled awe.

For Pournelle's columns - which still seem to exist in cyber-versions if you are a subscriber - consisted of the most mind-numbingly precise descriptions of his struggles to install software or add an upgrade board to one of his computers, all of which were endowed with names like "Anastasia" and "Alex".

Along the way he'd invariably drop in references to what he was doing while conducting this epic struggle, the latest goings-on in space exploration (one of his enthusiasms) plus the science-fiction book he was writing at the time (he always seemed to be writing new ones each month - impressive).

The net effect was that his articles ran to pages and pages of utterly irrelevant - but compulsively fascinating - detail about his daily and working life. I half-dreaded and half-longed for the monthly delivery of Byte, since I knew that I would soon be swept away on this irresistible and unstoppable torrent of high-tech logorrhea.

Visiting the site, I noticed the line "The Original Blog", linked to the following text:

I can make some claim to this being The Original Blog and Daybook. I certainly started keeping a day book well before most, and long before the term "blog" or Web Log was invented. BIX, the Byte exchange, preceeded the Web by a lot, and I also had a daily journal on GE Genie.

And in a flash, I realised why I had been mesmerised by but ambivalent about Pournelle's outpourings all those years ago. They did indeed form a proto-blog, with all the blog's virtues - a captivating first-person voice weaving a story over time - and all of its vices - a level of information way beyond what any sane person should really want to know, given the subject-matter.

Pournelle is right: he probably was the first blogger, but working on pre-Internet time - one posting a month, not one a day. However, it is hard to tell whether what we now know as blogs took off all those years later because of his pioneering example - or in spite of it.

JISC for Fun

I've written before about what seems to me the huge missed opportunity for free software in education. Of course, this is a two-way thing: as well as coders making more effort to serve the education sector, it would be nice to see education deploying more open source.

And hey presto, along comes the Joint Informaton Systems Committee (JISC), a government-funded UK body that offers advice and strategic guidance to further and higher education establishements, with a neat briefing paper on the very subject. What makes this doubly interesting is that last year it published a similar paper on open access: clearly things are beginning to click here.

What makes this trebly interesting - well, to me at least - is that I've been asked to speak at the Open Source and Sustainability conference in Oxford this April, organised by the JISC-funded OSS Watch.

"JISC for fun" as Linus almost said.

24 February 2006

Google's Creeping Cultural Imperialism

Another day, another Google launch.

As the official Google blog announced, the company is launching a pilot programme to digitise national archive content "and offer it to everyone in the world for free."

And what national archives might these be? Well, not just any old common-or-garden national archives, but "the National Archives", which as Google's blog says:

was founded with the express purpose of ... serving America by documenting our government and our nation.

Right, so these documents are fundamentally "serving America". A quick look at what's on offer reveals the United Motion Newsreel Pictures, a series which, according to the accompanying text, "was produced by the Office of War Information and financed by the U. S. government", and was "[d]esigned as a counter-propaganda medium."

So there we have it: this is (literally) vintage propaganda. And nothing wrong with that: everybody did it, and it's useful to be able to view how they did it. But as with the Google Print/Books project, there is a slight problem here.

When Google first started, it did not set out to become a search engine for US Web content: it wanted it all - and went a long way to achieving that, which is part of its power. But when it comes to books, and even more where films are concerned, there is just too much to hope to encompass; of necessity, you have to choose where to start, and where to concentrate your efforts.

Google, quite sensibly, has started with those nearest home, the US National Archives. But I doubt somehow that it will be rushing to add to other nations' archives. Of course, those nations could digitise and index their own archives - but it wouldn't be part of the Google collection, which would always have primacy, even if the indexed content were submitted to them.

It's a bit like Microsoft's applications: however much governments tell the company to erect Chinese walls between the programmers working on Windows and those working on applications, there is bound to be some leakiness. As a result, Windows programs from Microsoft have always had an advantage over those from other companies. The same will happen with Google's content: anything it produces will inevitably be more tightly integrated into their search engine.

And so, wittingly or not, Google becomes an instrument of cultural imperialism, just like that nice Mr Chirac warned. The problem is that there is nothing so terribly wrong with what Google is doing, or even the way that it is doing it; but it is important to recognise that these little projects that it sporadically announces are not neutral contributions to the sum of the world's open knowledge, but come with very particular biases and knock-on effects.