05 April 2009

Who Can Put the "Open" in Open Science?

One of the great pleasures of blogging is that your mediocre post tossed off in a couple of minutes can provoke a rather fine one that obviously took some time to craft. Here's a case in point.

The other day I wrote "Open Science Requires Open Source". This drew an interesting comment from Stevan Harnad, pretty much the Richard Stallman of open access, as well as some tweets from Cameron Neylon, one of the leading thinkers on and practitioners of open science. He also wrote a long and thoughtful reply to my post (including links to all our tweets, rigorous chap that he is). Most of it was devoted to pondering the extent to which scientists should be using open source:

It is easy to lose sight of the fact that for most researchers software is a means to an end. For the Open Researcher what is important is the ability to reproduce results, to criticize and to examine. Ideally this would include every step of the process, including the software. But for most issues you don’t need, or even want, to be replicating the work right down to the metal. You wouldn’t after all expect a researcher to be forced to run their software on an open source computer, with an open source chipset. You aren’t necessarily worried what operating system they are running. What you are worried about is whether it is possible read their data files and reproduce their analysis. If I take this just one step further, it doesn’t matter if the analysis is done in MatLab or Excel, as long as the files are readable in Open Office and the analysis is described in sufficient detail that it can be reproduced or re-implemented.

...

Open Data is crucial to Open Research. If we don’t have the data we have nothing to discuss. Open Process is crucial to Open Research. If we don’t understand how something has been produced, or we can’t reproduce it, then it is worthless. Open Source is not necessary, but, if it is done properly, it can come close to being sufficient to satisfy the other two requirements. However it can’t do that without Open Standards supporting it for documenting both file types and the software that uses them.

The point that came out of the conversation with Glyn Moody for me was that it may be more productive to focus on our ability to re-implement rather than to simply replicate. Re-implementability, while an awful word, is closer to what we mean by replication in the experimental world anyway. Open Source is probably the best way to do this in the long term, and in a perfect world the software and support would be there to make this possible, but until we get there, for many researchers, it is a better use of their time, and the taxpayer’s money that pays for that time, to do that line fitting in Excel. And the damage is minimal as long as source data and parameters for the fit are made public. If we push forward on all three fronts, Open Data, Open Process, and Open Source then I think we will get there eventually because it is a more effective way of doing research, but in the meantime, sometimes, in the bigger picture, I think a shortcut should be acceptable.

I think these are fair points. Science needs reproduceability in terms of the results, but that doesn't imply that the protocols must be copied exactly. As Neylon says, the key is "re-implementability" - the fact that you *can* reproduce the results with the given information. Using Excel instead of OpenOffice.org Calc is not a big problem provided enough details are provided.

However, it's easy to think of circumstances where *new* code is being written to run on proprietary engines where it is simply not possible to check the logic hidden in the black boxes. In these circumstances, it is critical that open source be used at all levels so that others can see what was done and how.

But another interesting point emerged from this anecdote from the same post:

Sometimes the problems are imposed from outside. I spent a good part of yesterday battling with an appalling, password protected, macroed-to-the-eyeballs Excel document that was the required format for me to fill in a form for an application. The file crashed Open Office and only barely functioned in Mac Excel at all. Yet it was required, in that format, before I could complete the application.

Now, this is a social issue: the fact that scientists are being forced by institutions to use proprietary software in order to apply for grants or whatever. Again, it might be unreasonable to expect young scientists to sacrifice their careers for the sake of principle (although Richard Stallman would disagree). But this is not a new situation. It's exactly the problem that open access faced in the early days, when scientists just starting out in their career were understandably reluctant to jeopardise it by publishing in new, untested journals with low impact factors.

The solution in that case was for established scientists to take the lead by moving their work across to open access journals, allowing the latter to gain in prestige until they reached the point where younger colleagues could take the plunge too.

So, I'd like to suggest something similar for the use of open source in science. When established scientists with some clout come across unreasonable requirements - like the need to use Excel - they should refuse. If enough of them put their foot down, the organisations that lazily adopt these practices will be forced to change. It might require a certain courage to begin with, but so did open access; and look where *that* is now...

Follow me on Twitter @glynmoody

8 comments:

  1. You say, "...this is a social issue: the fact that scientists are being forced by institutions to use proprietary software in order to apply for grants..." I don't see how proprietary software enters into it: the real problem seems to be red tape, which is independent of software licensing.

    ReplyDelete
  2. Because the red tape comes bundled with proprietary software - it forces people to use closed source. The red tape could be minimal, but there could still be a simple requirement to use MS Office formats, for example.

    ReplyDelete
  3. I use this link instead of the wikipedia article when explaining the Impact Factor. It contains background (including the wikipedia link), social realpolitik and a thorough but concise summary of what's wrong with the IF.

    ReplyDelete
  4. Anonymous9:08 pm

    Thanks for the link, and yes, putting together all the twitter references was a complete pain! In the particular use case of Excel here it wasn't especially red tape, indeed it was probably used in an attempt to make things easier for people. Just in practice once you take things out of the Windows/Office environment they start to break pretty rapidly.

    I think there is a more general point though which is that the data is very much the key. Once we have access to more open data (and not just research data) then more transparent analyses systems will pop up around them, probably not using OSS by default but with good intentions by people who will be sympathetic to being guided in that direction.

    ReplyDelete
  5. @Bill: thanks for the link - a useful discussion.

    ReplyDelete
  6. @Cameron: I agree about the data - once we have enough of the right kind, then people will be free to use it with open source too, and calls to do so will be more plausible.

    ReplyDelete
  7. the key is "re-implementability" - the fact that you *can* reproduce the results with the given information.

    I'd argue that the ideal situation is one where we don't have to waste time re-implementing. If you use open-source software and release your code, people can immediately build upon your work, rather than having to waste weeks or months re-implementing it before moving on.

    That said, with the sad state of affairs right now, I'd settle for getting sufficient detail to do re-implementations. As anyone who's tried can attest, it's not always possible or easy.

    ReplyDelete
  8. @Chris: Indeed. But it seems we're a long way from that perfect world, alas....

    ReplyDelete