28 February 2006

Wanted: a Rosetta for the MegaWikipedia

As I write, Wikipedia has 997,131 articles - close to the magic, if totally arbitrary, one million (if we had eleven fingers, we'd barely be halfway to the equally magic 1,771,561): the MegaWikipedia.

Except that it's not there, really. The English-language Wikipedia may be approaching that number, but there are just eight other languages with more than 100,000 articles (German, French, Italian, Japanese, Dutch, Polish, Portuguese and Swedish), and 28 with more than 10,000. Most have fewer than 10,000.

Viewed globally, then, Wikipedia is nowhere near a million articles on average, across the languages.

The disparity between the holdings in different languages is striking; it is also understandable, given the way Wikipedia - and the Internet - arose. But the question is not so much Where are we? as Where do we go from here? How do we bring most of the other Wikipedias - not just five or six obvious ones - up to the same level of coverage as the English one?

Because if we don't, Wikipedia will never be that grand, freely-available summation of knowledge that so many hope for: instead, it will be a grand, freely-available summation of knowledge for precisely those who already have access to much of it. And the ones who actually need that knowledge most - and who currently have no means of accessing it short of learning another language (for which they probably have neither the time nor the money) - will be excluded once more.

Clearly, this global Wikipedia cannot be achieved simply by hoping that there will be enough volunteers to write all the million articles in all the languages. In any case, this goes against everything that free software has taught us - that the trick is to build on the work of others, rather than re-invent everything each time (as proprietary software is forced to do). This means that most of the articles in non-English tongues should be based on those in English. Not because English is "better" as a language, or even because its articles are "better": but simply because they are there, and they provide the largest foundation on which to build.

A first step towards this is to use machine translations, and the new Wikipedia search engine Qwika shows the advantages and limitations of taking this approach. Qwika lets you search in several languages through the main Wikipedias and through machine-translations of the English version. In effect, it provides a pseudo-conversion of the English Wikipedia to other tongues.

But what is needed is something more thoroughgoing, something formal - a complete system for expediting the translation of all of the English-language articles into other languages. And not just a few: the system needs to be such that any translator can use it to create new content based on the English items. The company behind Ubuntu, Canonical, already has a system that does something similar for people who are translating open source software into other languages. It's called, appropriately enough, Rosetta.

Now that the MegaWikipedia is in sight - for Anglophones, at least - it would be the perfect time to move beyond the succerssful but rather ad hoc approach currently taken to creating multilingual Wikipedia content, and to put the Net's great minds to work on the creation of something better - something scalable: a Rosetta for the MegaWikipedia.

What better way to celebrate what is, for all the qualifications above, truly a milestone in the history of open content, than by extending it massively to all the peoples of the world, and not just to those who understand English?


Anonymous said...

easy peasy- ESPERANTO

Anonymous said...

It is a huge ongoing project that is growing very fast.
True, it won't hit 1mil on average soon, but still it is a very useful resource on any language for everyone

Glyn Moody said...

I couldn't agree more - I use it most days. I even browse through it just for the sheer pleasure of learning things.

But it seems to me that it needs to be taken to the next level - not by aiming for 2,000,000 English language articles, or whatever, but by moving the 1,000,000 out to other languages.

Doing that efficiently is going to be hard - but now is the time to think about it.

Anonymous said...


Seriously, what is a realistic plan for using esperanto in this context?

I'm not a lingust, but it seems that any constructed language is just bound to failure.

Excellent post, Glyn. You've set me off on a whole afternoon's research about this subject. I've been trying to get a translation project for WordPress going and when I discovered Rosetta I was blown away. I think that translation is an incredibly powerful tool that is deeply connected to the FOSS mentality, as you write.

It's an incredible shame the (we) the Wikipedians have been so slow to implement a real solution to this vast problem. With so many talented FOSS programmers around, I am hopeful that a solution is near ...

Glyn Moody said...

I'm glad it was of interest. I hope you manage to harness some of the collective brainpower out there for your WordPress idea.