open...: On the Necessity of Open Access and Open Data

One of the great things about open source is its transparency: you can't easily hide viruses or trojans, nor can you simply filch code from other people, as you can with closed source. Indeed, the accusations made from time to time that open source contains "stolen" code from other programs is deeply ironic, since it's almost certainly proprietary, closed software that has bits of thievery hidden deep within its digital bowels.

The same is true of open access and open data: when everything is out in the open, it is much easier to detect plagiarism or outright fraud. Equally, making it hard for people to access online, searchable text, or the underlying data by placing restrictions on its distribution reduces the number of people checking it and hence the likelihood that anyone will notice if something is amiss.

A nicely-researched piece on Ars Technica provides a clear demonstration of this:

Despite the danger represented by research fraud, instances of manufactured data and other unethical behavior have produced a steady stream of scandal and retractions within the scientific community. This point has been driven home by the recent retraction of a paper published in the journal Science and the recognition of a few individuals engaged in dozens of acts of plagiarism in physics journals.

By contrast, in the case of arXiv's preprint holdings, catching this stuff is relatively easy thanks to its open, online nature:

Computer algorithms to detect duplications of text have already proven successful at detecting plagiarism in papers in the physical sciences. The arXiv now uses similar software to scan all submissions for signs of plagiarized text. As this report was being prepared, the publishing service Crossref announced that it would begin a pilot program to index the contents of the journals produced by a number of academic publishers in order to expose them for the verification of originality. Thus, catching plagiarism early should be getting increasingly easy for the academic world.

Note, though, that open access allows *anyone* to check for plagiarism, not just the "authorised" keepers of the copyrighted academic flame.

Similarly, open data means anyone can take a peek, poke around and pick out problems:

How did Dr. Deb manage to create the impression that he had generated a solid data set? Roberts suggests that a number of factors were at play. Several aspects of the experiments allowed Deb to work largely alone. The mouse facility was in a separate building, and "catching a mouse embryo at the three-cell stage had him in from midnight until dawn," Dr. Roberts noted. Deb was also on his second post-doc position, a time where it was essential for him to develop the ability to work independently. The nature of the data itself lent it to manipulation. The raw data for these experiments consisted of a number of independent grayscale images that are normally assigned colors and merged (typically in Photoshop) prior to analysis.

Again, if the "raw data" were available to all, as good open notebook science dictates that they should be, any manipulation could be detected more readily.

Interestingly, this is not something that traditional "closed source" publishing can ever match using half-hearted fudges or temporary fixes, just as closed source programs can never match open ones for transparency. There is simply no substitute for openness.

08 August 2007

On the Necessity of Open Access and Open Data