open...: digital code of life

Showing posts with label digital code of life. Show all posts

05 May 2010

The GNU/Linux Code of Life

After I published Rebel Code in 2001, there was a natural instinct to think about writing another book (a natural masochistic instinct, I suppose, given the work involved.) I decided to write about bioinformatics – the use of computers to store, search through, and analyse the billions of DNA letters that started pouring out of the genomics projects of the 1990s, culminating in the sequencing of the human genome in 2001.

One reason I chose this area was the amazing congruence between the battle between free and closed-source software and the fight to place genomic data in the public domain, for all to use, rather than having it locked up in proprietary databases and enclosed by gene patents. As I like to say, Digital Code of Life is really the same story as Rebel Code, with just a few words changed.

Another reason for the similarity between the stories is the fact that genomes can be considered as a kind of program – the “digital code” of my title. As I wrote in the book:

In 1953, computers were so new that the idea of DNA as not just a huge digital store but a fully-fledged digital program of instructions was not immediately obvious. But this was one of the many profound implications of Watson and Crick's work. For if DNA was a digital store of genetic information that guided the construction of an entire organism from the fertilised egg, then it followed that it did indeed contain a preprogrammed sequence of events that created that organism – a program that ran in the fertilised cell, albeit one that might be affected by external signals. Moreover, since a copy of DNA existed within practically every cell in the body, this meant that the program was not only running in the original cell but in all cells, determining their unique characteristics.

That characterisation of the genome is something of a cliché these days, but back in 2003, when I wrote Digital Code of Life, it was less common. Of course, the interesting question is: to what extent is the genome *really* like an operating system? What are the similarities and differences? That's what a bunch of researchers wanted to find out by comparing the Linux kernel's control structure to that of the bacterium Escherichia coli:

The genome has often been called the operating system (OS) for a living organism. A computer OS is described by a regulatory control network termed the call graph, which is analogous to the transcriptional regulatory network in a cell. To apply our firsthand knowledge of the architecture of software systems to understand cellular design principles, we present a comparison between the transcriptional regulatory network of a well-studied bacterium (Escherichia coli) and the call graph of a canonical OS (Linux) in terms of topology and evolution.

We show that both networks have a fundamentally hierarchical layout, but there is a key difference: The transcriptional regulatory network possesses a few global regulators at the top and many targets at the bottom; conversely, the call graph has many regulators controlling a small set of generic functions. This top-heavy organization leads to highly overlapping functional modules in the call graph, in contrast to the relatively independent modules in the regulatory network.

We further develop a way to measure evolutionary rates comparably between the two networks and explain this difference in terms of network evolution. The process of biological evolution via random mutation and subsequent selection tightly constrains the evolution of regulatory network hubs. The call graph, however, exhibits rapid evolution of its highly connected generic components, made possible by designers' continual fine-tuning. These findings stem from the design principles of the two systems: robustness for biological systems and cost effectiveness (reuse) for software system.

The paper's well-worth reading, but if you find it heavy going (it's really designed for bioinformaticians and their ilk), there's an excellent, easy-to-read summary and analysis by Carl Zimmer in Discover magazine. Alternatively, you could just buy a copy of Digital Code of Life...

Follow me @glynmoody on Twitter or identi.ca.

21 July 2009

Has Google Forgotten Celera?

One of the reasons I wrote my book Digital Code of Life was that the battle between the public Human Genome Project and the privately-funded Celera mirrored so closely the battle between free software and Microsoft - with the difference that it was our genome that was at stake, not just a bunch of bits. The fact that Celera ultimately failed in its attempt to sequence and patent vast chunks of our DNA was the happiest of endings.

It seems someone else knows the story:

Celera was the company founded by Craig Venter, and funded by Perkin Elmer, which played a large part in sequencing the human genome and was hoping to make a massively profitable business out of selling subscriptions to genome databases. The business plan unravelled within a year or two of the publication of the first human genome. With hindsight, the opponents of Celera were right. Science is making and will make much greater progress with open data sets.

Here are some rea[s]ons for thinking that Google will be making the same sort of mistake as Celera if it pursues the business model outlined in its pending settlement with the AAP and the Author's Guild....

Thought provoking stuff, well worth a read.

Follow me @glynmoody on Twitter or identi.ca.

27 July 2008

The Church of Openness

In Digital Code of Life, I explained at length - some would say at excessive length - how the Human Genome Project was a key early demonstration of the transformative power of openness. Here's one of the key initiators of that project, George Church, who wants to open up genomics even more. Why? Because:

Exponentials don't just happen. In Church's work, they proceed from two axioms. The first is automation, the idea that by automating human tasks, letting a computer or a machine replicate a manual process, technology becomes faster, easier to use, and more popular. The second is openness, the notion that sharing technologies by distributing them as widely as possible with minimal restrictions on use encourages both the adoption and the impact of a technology.

And Church believes in openness so much, he's even applying to his sequencer:

In the past three years, more companies have joined the marketplace with their own instruments, all of them driving toward the same goal: speeding up the process of sequencing DNA and cutting the cost. Most of the second-generation machines are priced at around $500,000. This spring, Church's lab undercut them all with the Polonator G.007 — offered at the low, low price of $150,000. The instrument, designed and fine-tuned by Church and his team, is manufactured and sold by Danaher, an $11 billion scientific-equipment company. The Polonator is already sequencing DNA from the first 10 PGP volunteers. What's more, both the software and hardware in the Polonator are open source. In other words, any competitor is free to buy a Polonator for $150,000 and copy it. The result, Church hopes, will be akin to how IBM's open-architecture approach in the early '80s fueled the PC revolution.

22 July 2008

DNA = Do Not Ask

I wrote about this in Digital Code of Life, four years ago:

The Switzerland-based company says they can use a $199 DNA test (compare to $1,000 for 23andMe) to help you find your perfect match, statistically speaking. They’ve analyzed “hundreds of couples” and have determined the genetic patterns found in successful relationships. Based on their algorithm and your DNA, they’ll determine the probability for a satisfying and long-lasting relationship between two people.

OK, for certain diseases this is wise; for most - and certainly for relationships - it is not, if you think about the deeper implications of what's going on (see book for more....)

21 April 2008

Why You Should Boycott the UK Biobank

I first came across proposals for the the UK Biobank when I was writing Digital Code of Life in 2004. It's an exciting idea:

UK Biobank aims to study how the health of 500,000 people, currently aged 40-69, from all around the UK is affected by their lifestyle, environment and genes. The purpose of this major project is to improve the prevention, diagnosis and treatment of a wide range of illnesses (such as cancer, heart disease, diabetes, dementia, and joint problems) and to promote health throughout society.

By analysing answers, measurements and samples collected from participants, researchers may be able to work out why some people develop particular diseases while others do not. This should help us to find new ways to prevent early death and disability from many different diseases.

It's all about scaling: when you have vast amounts of information about populations, you can find out all kinds of correlations that would otherwise be obscured.

But as I noted in my book:

Meanwhile, the rise of biobanks - massive collections of DNA that may, like those in Iceland and Estonia, encompass an entire nation - will create tempting targets for data thieves.

This was well before the UK government started losing data like a leaky tap. Naturally, the UK Biobank has something to say on this issue:

Access is kept to a minimum. Very few staff have access to the key code. The computers which hold your information are protected by industry strength firewalls and are tested, so they are safe from hackers.

Sigh. Let's hope they know more about medical research than they do computer security.

But such security intrusions are not my main concern here. Again, as I wrote four years ago:

Governments do not even need to resort to underhand methods: they can simply arrogate to themselves the right to access such confidential information wherever it is stored. One of the questions addressed by the FAQ of a biobank involving half a million people, currently under construction in the United Kingdom, is: "Will the police have access to the information?" The answer - "only under court order" - does not inspire confidence.

I gathered from this blog post that invites are now going out, so I was interested to see what the UK Biobank has to say on the subject now that it has had time to reflect on matters:

Will the police have access to the information?

We will not grant access to the police, the security services or to lawyers unless forced to do so by the courts (and, in some circumstances, we would oppose such access vigorously).

"In some circumstances" - well, thanks a bunch. Clearly, nothing has changed here. The UK government will be able to waltz in anytime it wants and add those temping half a million DNA profiles to the four million it already has. After all, if you have nothing to hide, you can't possibly object.

Given the UK government's obsession with DNA profiles, and its contempt for any idea of privacy, you would be mad to sign up for the UK Biobank at present. Once your DNA is there (in the form of a blood sample), the only thing keeping it out of the government's hands is a quick vote in a supine Parliament.

Much as I'd like to support this idea, I won't have anything to do with it until our glorious leaders purge the current DNA database of the millions of innocent people - and *children* - whose DNA it holds, and shows itself even vaguely trustworthy with something as precious and quintessential as our genomes. And if the UK Biobank wants any credibility with the people whose help it needs, it would be saying the same thing.

21 February 2008

Welcome to ... The Spittoon

Last night I had the pleasure - and privilege - of attempting to hack the minds of a roomful of young scientists. It was my usual Digital Code of Life riff, and in the course of preparing my thoughts I wandered over to the 23AndMe site. This, you will recall, is:

a web-based service that helps you read and understand your DNA. After providing a saliva sample using an at-home kit, you can use our interactive tools to shed new light on your distant ancestors, your close family and most of all, yourself.
It is also the company set up by the wife of one of the Google founders - you can join the dots yourself.

But one thing I'd not come across before was the company's blog - called, rather charmingly, The Spittoon....

25 January 2008

Genomics Goes Read-Write

One of Larry Lessig's favourite tropes is that we live in a read-write world these days, where creation is just as important as consumption. Well, hitherto, genomics has been pretty much read only: you could sequence the DNA of an organism, but creating entire genomes of complex organisms (such as bacteria) has been too tricky. Now that nice Dr Venter says he's gone and done it:

A team of 17 researchers at the J. Craig Venter Institute (JCVI) has created the largest man-made DNA structure by synthesizing and assembling the 582,970 base pair genome of a bacterium, Mycoplasma genitalium JCVI-1.0. This work, published online today in the journal Science by Dan Gibson, Ph.D., et al, is the second of three key steps toward the team’s goal of creating a fully synthetic organism. In the next step, which is ongoing at the JCVI, the team will attempt to create a living bacterial cell based entirely on the synthetically made genome.

The team achieved this technical feat by chemically making DNA fragments in the lab and developing new methods for the assembly and reproduction of the DNA segments. After several years of work perfecting chemical assembly, the team found they could use homologous recombination (a process that cells use to repair damage to their chromosomes) in the yeast Saccharomyces cerevisiae to rapidly build the entire bacterial chromosome from large subassemblies.

He even gives some details (don't try this at home):

The process to synthesize and assemble the synthetic version of the M. genitalium chromosome began first by resequencing the native M. genitalium genome to ensure that the team was starting with an error free sequence. After obtaining this correct version of the native genome, the team specially designed fragments of chemically synthesized DNA to build 101 “cassettes” of 5,000 to 7,000 base pairs of genetic code. As a measure to differentiate the synthetic genome versus the native genome, the team created “watermarks” in the synthetic genome. These are short inserted or substituted sequences that encode information not typically found in nature. Other changes the team made to the synthetic genome included disrupting a gene to block infectivity. To obtain the cassettes the JCVI team worked primarily with the DNA synthesis company Blue Heron Technology, as well as DNA 2.0 and GENEART.

From here, the team devised a five stage assembly process where the cassettes were joined together in subassemblies to make larger and larger pieces that would eventually be combined to build the whole synthetic M. genitalium genome. In the first step, sets of four cassettes were joined to create 25 subassemblies, each about 24,000 base pairs (24kb). These 24kb fragments were cloned into the bacterium Escherichia coli to produce sufficient DNA for the next steps, and for DNA sequence validation.

The next step involved combining three 24kb fragments together to create 8 assembled blocks, each about 72,000 base pairs. These 1/8th fragments of the whole genome were again cloned into E. coli for DNA production and DNA sequencing. Step three involved combining two 1/8th fragments together to produce large fragments approximately 144,000 base pairs or 1/4th of the whole genome.

At this stage the team could not obtain half genome clones in E. coli, so the team experimented with yeast and found that it tolerated the large foreign DNA molecules well, and that they were able to assemble the fragments together by homologous recombination. This process was used to assemble the last cassettes, from 1/4 genome fragments to the final genome of more than 580,000 base pairs. The final chromosome was again sequenced in order to validate the complete accurate chemical structure.

But the real kicker was this comment:

“This is an exciting advance for our team and the field. However, we continue to work toward the ultimate goal of inserting the synthetic chromosome into a cell and booting it up to create the first synthetic organism,” said Dan Gibson, lead author.

Yup, you read that correctly: we're talking about porting and then *booting-up* an artificial genome, aka digital code of life.

10 June 2007

The Bad Boy of Genomics Strikes Again

When I was writing Digital Code of Life, I sought to be scrupulously fair to Craig Venter, who was often demonised for his commercial approach to science. Ind fact, it seemed to me he had often gone out of his way to make the results of his work available.

So it's with some sadness that I note that the "Bad Boy of Genomics" epithet seems justified in this more recent case:

A research institute has applied for a patent on what could be the first largely artificial organism. And people should be alarmed, claims an advocacy group that is trying to shoot down the bid.

...

The artificial organism, a mere microbe, is the brainchild of researchers at the Rockville, Md.-based J. Craig Venter Institute. The organization is named for its founder and CEO, the geneticist who led the private sector race to map the human genome in the late 1990s.

The researchers filed their patent claim on the artificial organism and on its genome. Genetically modified life forms have been patented before; but this is the first patent claim for a creature whose genome might be created chemically from scratch, Mooney said.

This is problematic on a number of levels. For a start, it shouldn't be possible to patent DNA, since it is not an invention. Simply combining existing sequences is not an invention either. There is also the worry that what is being created here is the first genomic operating system: locking others out with patents maans repeating all the mistakes that have been made in some jurisdictions by allowing the patenting of conventional software.

23 May 2007

Googling the Genome, Part II

23andMe is a privately held company developing new ways to help you make sense of your own genetic information.

Even though your body contains trillions of copies of your genome, you've likely never read any of it. Our goal is to connect you to the 23 paired volumes of your own genetic blueprint (plus your mitochondrial DNA), bringing you personal insight into ancestry, genealogy, and inherited traits. By connecting you to others, we can also help put your genome into the larger context of human commonality and diversity.

Toward this goal, we are building on recent advances in DNA analysis technologies to enable broad, secure, and private access to trustworthy and accurate individual genetic information. Combined with educational and scientific resources with which to interpret and understand it, your genome will soon become personal in a whole new way.

Nothing special there, of course. What makes this news is the following:

Google said it had invested $3.9 million in the company, called 23andMe Inc., giving the Mountain View, California-based Google a minority stake in the start-up, according to a filing with the U.S. Securities and Exchange Commission.

I wrote about this three years ago, but purely theoretically. Be very afraid. (Via TechCrunch.)

19 March 2007

Open Knowledge, Open Greenery and Modularity

On Saturday I attended the Open Knowledge 1.0 meeting, which was highly enjoyable from many points of view. The location was atmospheric: next to Hawksmoor's amazing St Anne's church, which somehow manages the trick of looking bigger than its physical size, inside the old Limehouse Town Hall.

The latter had a wonderfully run-down, almost Dickensian feel to it; it seemed rather appropriate as a gathering place for a ragtag bunch of ne'er-do-wells: geeks, wonks, journos, activists and academics, all with dangerously powerful ideas on their minds, and all more dangerously powerful for coming together in this way.

The organiser, Rufus Pollock, rightly placed open source squarely at the heart of all this, and pretty much rehearsed all the standard stuff this blog has been wittering on about for ages: the importance of Darwinian processes acting on modular elements (although he called the latter atomisation, which seems less precise, since atoms, by definition, cannot be broken up, but modules can, and often need to be for the sake of increased efficiency.)

One of the highlights of the day for me was a talk by Tim Hubbard, leader of the Human Genome Analysis Group at the Sanger Institute. I'd read a lot of his papers when writing Digital Code of Life, and it was good to hear him run through pretty much the same parallels between open genomics and the other opens that I've made and make. But he added a nice twist towards the end of his presentation, where he suggested that things like the doomed NHS IT programme might be saved by the use of Darwinian competition between rival approaches, each created by local NHS groups.

The importance of the ability to plug into Darwinian dynamics also struck me when I read this piece by Jamais Cascio about carbon labelling:

In order for any carbon labeling endeavor to work -- in order for it to usefully make the invisible visible -- it needs to offer a way for people to understand the impact of their choices. This could be as simple as a "recommended daily allowance" of food-related carbon, a target amount that a good green consumer should try to treat as a ceiling. This daily allowance doesn't need to be a mandatory quota, just a point of comparison, making individual food choices more meaningful.

...

This is a pattern we're likely to see again and again as we move into the new world of carbon footprint awareness. We'll need to know the granular results of actions, in as immediate a form as possible, as well as our own broader, longer-term targets and averages.

Another way of putting this is that for these kind of ecological projects to work, there needs to be a feedback mechanism so that people can see the results of their actions, and then change their behaviour as a result. This is exactly like open source: the reason the open methodology works so well is that a Darwinian winnowing can be applied to select the best code/content/ideas/whatever. But that is only possible when there are appropriate metrics that allow you to judge which actions are better, a reference point of the kind Cascio is writing about.

By analogy, we might call this particular kind of environmental action open greenery. It's interesting to see that here, too, the basic requirement of modularity turns out to be crucially important. In this case, the modularity is at the level of the individual's actions. This means that we can learn from other people's individual success, and improve the overall efficacy of the actions we undertake.

Without that modularity - call its closed-source greenery - everything is imposed from above, without explanation or the possibility of local, personal, incremental improvement. That may have worked in the 20th century, but given the lessons we have learned from open source, it's clearly not the best way.

02 February 2007

Genetic Information Nondiscrimination Act of 2007

Because of this:

(1) Deciphering the sequence of the human genome and other advances in genetics open major new opportunities for medical progress. New knowledge about the genetic basis of illness will allow for earlier detection of illnesses, often before symptoms have begun. Genetic testing can allow individuals to take steps to reduce the likelihood that they will contract a particular disorder. New knowledge about genetics may allow for the development of better therapies that are more effective against disease or have fewer side effects than current treatments. These advances give rise to the potential misuse of genetic information to discriminate in health insurance and employment.

(2) The early science of genetics became the basis of State laws that provided for the sterilization of persons having presumed genetic `defects' such as mental retardation, mental disease, epilepsy, blindness, and hearing loss, among other conditions. The first sterilization law was enacted in the State of Indiana in 1907. By 1981, a majority of States adopted sterilization laws to `correct' apparent genetic traits or tendencies. Many of these State laws have since been repealed, and many have been modified to include essential constitutional requirements of due process and equal protection. However, the current explosion in the science of genetics, and the history of sterilization laws by the States based on early genetic science, compels Congressional action in this area.

Everybody needs something like this:

legislation establishing a national and uniform basic standard is necessary to fully protect the public from discrimination and allay their concerns about the potential for discrimination, thereby allowing individuals to take advantage of genetic testing, technologies, research, and new therapies.

And beyond "simple" discrimination, there's going to be stuff like this:

Consider a not-too-distant future in which personal genomes are readily available. For those with relations affected by a serious medical condition, this will conveniently provide them with any genetic test they need. But it will also offer the rest of us information about our status for these and other, far less serious, autosomal recessive disorders that might similarly manifest themselves in children if we married a fellow carrier.

A bioinformatics program running on a PC could easily check our genomes for all genes associated with the autosomal recessive disorders that had been identified so far. Regular software updates downloaded from the internet - like those for anti-virus programs - would keep our search software abreast of the latest medical research. The question is, how potentially serious does a variant gene's effects have to be for us to care about its presence in our DNA? Down to what level should we be morally obliged to tell our prospective partners - or have the right to ask about?

And just when is the appropriate moment to swap all these delicate DNA details? Before getting married? Before going to bed together? Before even exchanging words? Will there one day be a new class of small, wireless devices that hold our personal genomic profile in order to carry out discreet mutual compatibility checks on nearby potential partners: a green light for genomic joy, a red one for excessive recessive risks?

Given the daunting complexity of the ethical issues raised by knowing the digital code of life in detail, many may opt for the simplest option: not to google it. But even if you refuse to delve within your genome, there are plenty of others who will be keen to do so. Employers and insurance companies would doubtless love to scan your data before giving you a job or issuing a policy. And if your children and grandchildren have any inconvenient or expensive medical condition that they have inherited from one side of the family, they might like to know which - not least, to ensure that they sue the right person.

Another group that is likely to be deeply interested in googling your genome are the law enforcement agencies. Currently, DNA is used to match often microscopic samples found at the scene of a crime, for example, with those taken from suspects, by comparing special, short regions of it - DNA "fingerprints". The better the match, the more likely it is that they came from the same individual. Low-cost sequencing technologies would allow DNA samples to be analysed completely - not just to give patterns for matching, but even rough indications of physical and mental characteristics - convenient for rounding up suspects. This is a rather hit-and-miss approach, though, where success depends on pulling in the right people. How much more convenient it would be if everyone's DNA were already to hand, allowing a simple text matching process to find the guilty party.

Nobody ever said digital DNA was going to be easy.

29 January 2007

'Omics - Oh My!

One of the fun aspects of writing my book Digital Code of Life was grappling with all the 'omics: not just genomics, but proteomics and metabolomics too. Here's what I wrote about the latter:

"Metabolome" is the name given to all the molecules - not just the proteins - involved in metabolic processes within a given cell.

And here's the big news:

Scientists in Alberta say they are the first team to finish a draft of the chemical equivalent of the human genome, paving the way for faster, cheaper diagnoses of disease.

The researchers on Wednesday said the Human Metabolome Project, led by the University of Alberta, has listed and described some 2,500 chemicals found in or made by the body (three times as many as expected), and double that number of substances stemming from drugs and food. The chemicals, known as metabolites, represent the ingredients of life just as the human genome represents the blueprint of life.

This does seem to differ from my definition, but hey, my shoulders are broad.
(Via Slashdot.)

31 August 2006

Books Be-Googled

I've not really been paying much attention to the Google Book Search saga. Essentially, I'm totally in favour or what they're up to, and regard publishers' whines about copyright infringement as pathetic and wrong-headed. I'm delighted that Digital Code of Life has been scanned and can be searched.

It seems obvious to me that scanning books will lead to increased sales, since one of the principal obstacles to buying a book is being uncertain whether it's really what you want. Being able to search for a few key phrases is a great way to try before you buy.

Initially, I wasn't particularly excited by the news that Google Book Search now allows public domain books to be downloaded as images (not as text files - you need Project Gutenberg for that.) But having played around with it, I have to say that I'm more impressed: being able to see the scan of venerable and often obscure books is a delightful experience.

It is clearly an important step in the direction of making all knowledge available online. Let's hope a few publishers will begin to see the project in the same light, and collaborate with the thing rather than fight it reflexively.

13 July 2006

Open Source Evolution

Carl Zimmer is one of the best science writers around today. He manages to combine technical accuracy with a writing style that never gets in the way of his argument. So I was delighted to see this piece on his blog, entitled: "In the Beginning Was Linux?", which includes the following section:

Biologists have long recognized some striking parallels between genes and software. Genes stored information in a language of DNA, with the four nucleotides serving as its alphabet. A genetic code allowed cells to translate the information in genes into the separate language of proteins, which used an alphabet of twenty amino acids. From one generation to the next, mutations introduced slight tweaks to the software. Sex combined different versions of subroutines. If the software performed better--in the sense that an organism had more reproductive success--the changes might become incorporated into the genome across an entire species.

Now, this is amusingly close to the opening chapter (and central idea) of Digital Code of Life, but Zimmer goes further by drawing on the theories of Carl Woese, one of the most original thinkers about how life might have evolved in the earliest stages. It would take too long to explain the details to non-biologists, so I won't attempt it here - not least because Zimmer has already done with customary clarity in his post. Do read it.

10 July 2006

It's a Dog's Life

One of the fascinating things that I learned when I was writing Digital Code of Life is that many diseases - such as obesity, heart disease, diabetes, certain kinds of cancers and neurodegenerative disorders - are not commonly found in the great apes. As I put it then:

In a sense, the human genome has evolved certain advantageous characteristics so quickly that it has not been debugged properly. The major diseases afflicting humans are the outstanding faulty modules in genomic software that Nature was unable to fix in the time since humans evolved as a species.

Another extraordinary fact is that dogs are even more susceptible to these same diseases than humans are, and for the same reason: the domestic breeds have arisen so recently, and from limited populations through inbreeding. But if dogs are like us, only more so, then they also hold out the hope that by investigating the root causes of their afflictions we might be able to understand our own better.

I see that further steps in this direction are now being taken:

Melbourne researchers are examining the DNA of dogs in a research project aiming at determining the genetic causes of common pet diseases – and to provide a model for combating diseases such as diabetes and multiple sclerosis in humans.

06 May 2006

O Happy, Happy Digital Code

My book Digital Code of Life was partly about the battle to keep genomic and other bioinformatics information open. So it's good to see the very first public genomic database, now EMBL, spreading its wings and mutating into FELICS (Free European Life-science Information and Computational Services) with even more bioinformatics goodies freely available (thanks to a little help from the Swiss Institute for Bioinformatics, the University of Cologne, Germany, and the European Patent Office).

19 April 2006

Amazon Plays Tag, Blog and Wiki

For all its patent faults, Amazon.com is one of my favourite sites. It has repeatedly done the right thing when mistakes have been made with my orders, to the extent that I can even forgive them for doing the wrong thing when it comes to (IP) rights....

So I was interested to see that Amazon.com now lets users add tags to items: I first noticed this on Rebel Code, where some public-minded individual has kindly tagged it as open source, free software and linux. Clicking on one of these brings up a listing of other items similarly tagged (no surprise there). It also cross-references this with the customers who used this tag, and the other tags that are used alongside the tag you are viewing (a bit of overkill, this, maybe).

I was even more impressed to see a ProductWiki at the foot of the Rebel Code page (it's rather empty at the moment). This is in addition to the author's blog (which I don't have yet because Amazon insists on some deeply arcane rite to establish I am really the Glyn Moody who wrote Rebel Code and not his evil twin brother from a parallel universe). Mr. Bezos certainly seems to be engaging very fully with the old Web 2.0 stuff; it will be interesting to see how other e-commerce sites respond.

30 March 2006

Googling the Genome

I came across this story about Google winning an award as part of the "Captain Hook Awards for Biopiracy" taking part in the suitably piratical-sounding Curitiba, Brazil. The story links to the awards Web site - rather fetching in black, white and red - where there is a full list of the lucky 2006 winners.

I was particularly struck by one category: Most Shameful Act of Biopiracy. This must have been hard to award, given the large field to choose from, but the judges found a worthy winner in the shape of the US Government for the following reason:

For imposing plant intellectual property laws on war-torn Iraq in June 2004. When US occupying forces “transferred sovereignty” to Iraq, they imposed Order no. 84, which makes it illegal for Iraqi farmers to re-use seeds harvested from new varieties registered under the law. Iraq’s new patent law opens the door to the multinational seed trade, and threatens food sovereignty.

Google's citation for Biggest Threat to Genetic Privacy read as follows:

For teaming up with J. Craig Venter to create a searchable online database of all the genes on the planet so that individuals and pharmaceutical companies alike can ‘google’ our genes – one day bringing the tools of biopiracy online.

I think it unlikely that Google and Venter are up to anything dastardly here: from studying the background information - and from my earlier reading on Venter when I was writing Digital Code of Life - I think it is much more likely that they want to create the ultimate gene reference, but on a purely general, not personal basis.

Certainly, there will be privacy issues - you won't really want to be uploading your genome to Google's servers - but that can easily be addressed with technology. For example, Google's data could be downloaded to your PC in encrypted form, decrypted by Google's client application running on your computer, and compared with your genome; the results could then be output locally, but not passed back to Google.

It is particularly painful for me to disagree with the Coalition Against Biopiracy, the organisation behind the awards, since their hearts are clearly in the right place - they even kindly cite my own 2004 Googling the Genome article in their background information to the Google award.

09 March 2006

The Dream of Open Data

Today's Guardian has a fine piece by Charles Arthur and Michael Cross about making data paid for by the UK public freely accessible by them. But it goes beyond merely detailing the problem, and represents the launch of a campaign called "Free Our Data". It's particularly good news that the unnecessary hoarding of data is being addressed by a high-profile title like the Guardian, since a few people in the UK Government might actually read it.

It is rather ironic that at a time when nobody outside Redmond disputes the power of open source, and when open access is almost at the tipping point, open data remains something of a distant dream. Indeed, it is striking how advanced the genomics community is in this respect. As I discovered when I wrote Digital Code of Life, most scientists in this field have been routinely making their data freely available since 1996, when the Bermuda Principles were drawn up. The first of these stated:

It was agreed that all human genomic sequence information, generated by centres funded for large-scale human sequencing, should be freely available and in the public domain in order to encourage research and development and to maximise its benefit to society.

The same should really be true for all kinds of large-scale data that require governmental-scale gathering operations. Since they cannot be feasibly gathered by private companies, such data ends up as a government monopoly. But trying to exploit that monopoly by crudely over-charging for the data is counter-productive, as the Guardian article quantifies. Let's hope the campaign gathers some momentum - I'll certainly being doing my bit.

Update: There is now a Web site devoted to this campaign, including a blog.