06 October 2008

The Marvellous Mr. arXiv

Paul Ginsparg is one of the key players in the world of open access. Indeed, he was practising it online before it even had a name, when he set up the arXiv preprint server (originally known simply by its address "xxx.lanl.gov"), which has just celebrated its half-millionth deposit:

arXiv is the primary daily information source for hundreds of thousands of researchers in many areas of physics and related fields. Its users include the world's most prominent researchers in science, including 53 Physics Nobel Laureates, 31 Fields Medalists and 55 MacArthur Fellows, as well as people in countries with limited access to scientific materials. The famously reclusive Russian mathematician Grigori Perelman posted the proof for the 100-year-old Poincare Conjecture solely in arXiv.

Journalists also use the repository extensively to prepare articles for the general public about newly released scientific results. It has long stood at the forefront of the open-access movement and served as the model for many other initiatives, including the National Institute of Health?fs PubMedCentral repository, and the many institutional DSpace repositories. arXiv is currently ranked the No. 1 repository in the world by the Webometrics Ranking of World Universities.

"arXiv began its operations before the World Wide Web, search engines, online commerce and all the rest, but nonetheless anticipated many components of current 'Web 2.0' methodology," said Cornell professor Paul Ginsparg, arXiv's creator. "It continues to play a leading role at the forefront of new models for scientific communication."

Given his pivotal role in the open access, it's good that Ginsparg has expanded on that rather compressed history of his work in a fascinating romp through both the creation of arXiv and his own personal experience of the nascent Internet and Web.

Here's a few of the highlights:

I first used e-mail on the original ARPANET — a predecessor of the Internet — during my freshman year at Harvard University in 1973, while my more business-minded classmates Bill Gates and Steve Ballmer, the future Microsoft bosses, were already plotting ahead to ensure that our class would have the largest average net worth of any undergraduate year ever.


At the Aspen Center for Physics, in Colorado, in the summer of 1991, a stray comment from a physicist, concerned about e-mailed articles overrunning his disk allocation while travelling, suggested to me the creation of a centralized automated repository and alerting system, which would send full texts only on demand. That solution would also democratize the exchange of information, levelling the aforementioned research playing field, both internally within institutions and globally for all with network access.

Thus was born xxx.lanl.gov, initially an e-mail/FTP server.


In the autumn of 1992, a colleague at CERN e-mailed me: “Q: do you know the world-wide-web program?” I did not, but quickly installed WorldWideWeb.app, coincidentally written by Tim Berners-Lee for the same NeXT computer that I was using, and with whom I began to exchange e-mails. Later that autumn, I used it to help beta-test the first US Web server, set up by the library at the Stanford Linear Accelerator Center for use by the high-energy physics community.


That sceptical attitude regarding the potential efficacy of full-text searching carried over to my own website’s treatment of crawlers as unwanted nuisances. Seemingly out-of-control and anonymously run crawls sometimes resulted in overly vociferous complaints to network administrators from the offending domain. I was recently reminded of a long-forgotten incident involving test crawls from some unmemorably named stanford.edu-hosted machines in mid-1996, when both Sergey Brin and Larry Page graciously went out of their way to apologize to me in person at Google headquarters for their deeds all those years ago.


no legislation is required to encourage users to post videos to YouTube, whose incentive of instant gratification comes through making personal content publicly available (which parallels with the scholarly benefit of voluntary participation in the incipient version of arXiv in 1991.)

Fascinating tales from a fascinating life.

