Showing posts with label scholars. Show all posts
Showing posts with label scholars. Show all posts

13 December 2007

Building the Zotero Commons

One of the many insights that have come out of open source is what might be called the "pebble on the cairn" effect - the idea that by combining the small, even negligible, individual efforts we can create something large and durable.

Here's a perfect example that builds on the fact that scholars very often scan books in the public domain during the course of their research, but then don't do anything with those scans. What if they were all brought together, and then fed into an OCR system?

If many researchers have had to scan rare documents or books for their own perusal, there’s a potential treasure trove of material that exists among their combined efforts. Rather than let all that scholarship rot, or waste away in data files, the university’s Center for History and New Media sees an opportunity to create an open archive of scholarly resources in the public domain.

...

In partnership with the Internet Archive, and with funding from the Andrew W. Mellon Foundation, the center is creating a way for scholars to upload existing data files to be optically scanned (to make them text-searchable) and stored in a database available to the public.

Even better is that fact that open source software can be used to make realise this idea:

The vehicle for the new environment will be the Zotero plug-in for the Firebox browser, also developed by the center. The software stores Web pages, collects citations and lets scholars annotate and organize online documents. A new feature of the plug-in will allow people to collaborate and share materials through a dedicated server. Building on that functionality, according to Cohen, the system will allow scholars to drag and drop documents onto an icon in Zotero that essentially sends it to the Internet Archive for storage and free optical character recognition.

The eventual result of the project, called Zotero Commons, could be reduced need need for research trips, Cohen suggested.

(Via Open Access News.)

04 April 2006

Coughing Genomic Ink

One of the favourite games of scholars working on ancient texts that have come down to us from multiple sources is to create a family tree of manuscripts. The trick is to look for groups of textual divergences - a word added here, a mis-spelling there - to spot the gradual accretions, deletions and errors wrought by incompetent, distracted or bored copyists. Once the tree has been established, it is possible to guess what the original, founding text might have looked like.

You might think that this sort of thing is on the way out; on the contrary, though, it's an extremely important technique in bioinformatics - hardly a dusty old discipline. The idea is to treat genomes deriving from a common ancestor as a kind of manuscript, written using just the four letters - A, C, G and T - found in DNA.

Then, by comparing the commonalities and divergences, it is possible to work out which manuscripts/genomes came from a common intermediary, and hence to build a family tree. As with manuscripts, it is then possible to hazard a guess at what the original text - the ancestral genome - might have looked like.

That, broadly, is the idea behind some research that David Haussler at the University of California at Santa Cruz is undertaking, and which is reported on in this month's Wired magazine (freely available thanks to the magazine's enlightened approach to publishing).

As I described in Digital Code of Life, Haussler played an important role in the closing years of the Human Genome Project:

Haussler set to work creating a program to sort through and assemble the 400,000 sequences grouped into 30,000 BACs [large-scale fragments of DNA] that had been produced by the laboratories of the Human Genome Project. But in May 2000, when one of his graduate students, Jim Kent, inquired how the programming was going, Haussler had to admit it was not going well. Kent had been a professional programmer before turning to research. His experience in writing code against deadlines, coupled with a strongly-held belief that the human genome should be freely available, led him to volunteer to create the assembly program in short order.

Kent later explained why he took on the task:

There was not a heck of a lot that the Human Genome Project could say about the genome that was more informative than 'it's got a lot of As, Cs, Gs and Ts' without an assembly. We were afraid that if we couldn't say anything informative, and thereby demonstrate 'prior art', much of the human genome would end up tied up in patents.

Using 100 800 MHz Pentiums - powerful machines in the year 2000 - running GNU/Linux, Kent was able to lash up a program, assemble the fragments and save the human genome for mankind.

Haussler's current research depends not just on the availability of the human genome, but also on all the other genomes that have been sequenced - the different manuscripts written in DNA that have come down to us. Using bioinformatics and even more powerful hardware than that available to Kent back in 2000, it is possible to compare and contrast these genomes, looking for tell-tale signs of common ancestors.

But the result is no mere dry academic exercise: if things go well, the DNA text that will drop out at the end will be nothing less than the genome of one of our ancient forebears. Even if Wired's breathless speculations about recreating live animals from the sequence seem rather wide of the mark - imagine trying to run a computer program recreated in a similar way - the genome on its own will be treasure enough. Certainly not bad work for those scholars who "cough in ink" in the world of open genomics.