Coughing Genomic Ink
One of the favourite games of scholars working on ancient texts that have come down to us from multiple sources is to create a family tree of manuscripts. The trick is to look for groups of textual divergences - a word added here, a mis-spelling there - to spot the gradual accretions, deletions and errors wrought by incompetent, distracted or bored copyists. Once the tree has been established, it is possible to guess what the original, founding text might have looked like.
You might think that this sort of thing is on the way out; on the contrary, though, it's an extremely important technique in bioinformatics - hardly a dusty old discipline. The idea is to treat genomes deriving from a common ancestor as a kind of manuscript, written using just the four letters - A, C, G and T - found in DNA.
Then, by comparing the commonalities and divergences, it is possible to work out which manuscripts/genomes came from a common intermediary, and hence to build a family tree. As with manuscripts, it is then possible to hazard a guess at what the original text - the ancestral genome - might have looked like.
That, broadly, is the idea behind some research that David Haussler at the University of California at Santa Cruz is undertaking, and which is reported on in this month's Wired magazine (freely available thanks to the magazine's enlightened approach to publishing).
As I described in Digital Code of Life, Haussler played an important role in the closing years of the Human Genome Project:Haussler set to work creating a program to sort through and assemble the 400,000 sequences grouped into 30,000 BACs [large-scale fragments of DNA] that had been produced by the laboratories of the Human Genome Project. But in May 2000, when one of his graduate students, Jim Kent, inquired how the programming was going, Haussler had to admit it was not going well. Kent had been a professional programmer before turning to research. His experience in writing code against deadlines, coupled with a strongly-held belief that the human genome should be freely available, led him to volunteer to create the assembly program in short order.
Kent later explained why he took on the task:There was not a heck of a lot that the Human Genome Project could say about the genome that was more informative than 'it's got a lot of As, Cs, Gs and Ts' without an assembly. We were afraid that if we couldn't say anything informative, and thereby demonstrate 'prior art', much of the human genome would end up tied up in patents.
Using 100 800 MHz Pentiums - powerful machines in the year 2000 - running GNU/Linux, Kent was able to lash up a program, assemble the fragments and save the human genome for mankind.
Haussler's current research depends not just on the availability of the human genome, but also on all the other genomes that have been sequenced - the different manuscripts written in DNA that have come down to us. Using bioinformatics and even more powerful hardware than that available to Kent back in 2000, it is possible to compare and contrast these genomes, looking for tell-tale signs of common ancestors.
But the result is no mere dry academic exercise: if things go well, the DNA text that will drop out at the end will be nothing less than the genome of one of our ancient forebears. Even if Wired's breathless speculations about recreating live animals from the sequence seem rather wide of the mark - imagine trying to run a computer program recreated in a similar way - the genome on its own will be treasure enough. Certainly not bad work for those scholars who "cough in ink" in the world of open genomics.