09 February 2010

Of Open Science and Open Source

Now here's an idea:

Computer code is also at the heart of a scientific issue. One of the key features of science is deniability: if you erect a theory and someone produces evidence that it is wrong, then it falls. This is how science works: by openness, by publishing minute details of an experiment, some mathematical equations or a simulation; by doing this you embrace deniability. This does not seem to have happened in climate research. Many researchers have refused to release their computer programs — even though they are still in existence and not subject to commercial agreements. An example is Professor Mann's initial refusal to give up the code that was used to construct the 1999 "hockey stick" model that demonstrated that human-made global warming is a unique artefact of the last few decades. (He did finally release it in 2005.)

Quite.

So, if you are publishing research articles that use computer programs, if you want to claim that you are engaging in science, the programs are in your possession and you will not release them then I would not regard you as a scientist; I would also regard any papers based on the software as null and void.

I'd go further: if you won't release them and *share* them, then you're not really a scientist, because science is inherently about sharing, not hoarding knowledge, whatever kind that may be. The fact that some of it may be in the form of computer code is a reflection of the fact that research is increasingly resting on digital foundations, nothing more.

Follow me @glynmoody on Twitter or identi.ca.

5 comments:

Anonymous said...

I dont agree, I dont know if you have worked in R&D or science industries. But it's not about you're specific techniques. It's about data and equations.

As for climate change, it would be far more preferable for the publisher of the results to publish the equation that the software does on the data.

This allows independent testing and analysis of the equations, and with the raw data for other groups to apply that equation to their own simulation and analysis.

A climate scientist is not necessarily a computer programmer, and to work through a complex program that someone else has written is not what they are good at.

But given the equations, and raw data another group of scientists can independantly analyse the equations and method and confirm or disprove the initial findings.

The requirement of independently being able to repeat the experiement and confirm the analysis, does not involve getting exactly the same raw data, and feeding it into exactly the same analysis software, and seeing if you get exactly the same result.
Ofcourse you will.

What is far better, is that the original publisher of the findings, provides the method of analysis, (ie the equations) and provide that data and those equations for other scientists.

This happens when the original scientist publishes his results in a peer reviewed journal.

Trying to define the method of analysis (the equations) from the code base of the original software would be a nightmare. So it's not normally done that way.

Another example, would be in astronomy, complex telescopes require advanced software to collate and analyse the data. but it's the data and the method of gathering the data that is significant, not the software that drives the system.

But to have 10,000 scientists doing the same analysis on the same raw data, with the same analysis software will not provide any more information than the first guy who did it.

It requires independent testing, that means independent analysis, even if the same underlying equations are employed.

Anonymous said...

the "butterfly effect" is the classic example, and also relates to climate/weather.

Two supercomputers were set to perform weather analysis (large scale), the input data for the two supercomputers were changed by a VERY VERY small amount (beat of a butterfly wing).

The result of the simulation was that one sim created a cyclone on the other side of the planet and the other did not predict a cyclone.

Therefore, "the difference of the beat of a butterfly wing" on the initial data products greatly different results, (with the same software, and computer).

I was working a great deal with a professor (emertis) on wind/water interaction (how wind creates waves and swell), we spent years instrumenting a large lake with wind and wave measuring equipment.

All to make slightly more accurate ONE SMALL TERM of a complex equation for wind/water interaction. (specifically the shallow water interactions).

So again, it's the equations that make for science, not the exact method of achieving the equations.

So there is no requirement to get the original developers software, but mearly the equation, which you then confirm with you're own testing and methods.
Otherwise you're not really doing science, you're just parroting someone elses work. Which is far less satisfactory, and does not contribute to the scientific knowledge base.

Science is about looking at something in a new and different way, and seeing if the underlying principles hold up. And in findiing different methods of proving the underlying theory.

If someone gets the same raw data, and the same software then what scientific investigation is he performing, he's not really he is just repeating exactly the same experiment as the first person.

This is not how science is done.

If the same raw data can be applied to two different analysis methods and it gives essentially the same results then it's a confirmation of the underlying theory. If the same raw data provides a greatly different result with two different methods of analysis, then you can question the analysis technique.

But performing the same experiment (exactly) over and over again using the same analysis methods, is what I would call NOT being a scientist at all.

Glyn Moody said...

I quite agree: the point is, we need the equations, data *and* software. If any of these is missing, it's not possible to check the working completely.

As you say, you also want to use different data sets, and maybe different software and equations, too, but the point is to make any sensible comments about whether the science was done properly you do need the software as well.

Anonymous said...

Yes, well said, it's about initially being able to repeat the initial experiment/analysis. Then to indepently confirm or disprove the underlying hypothesis using other methods or techniques.

And you would have to question why the initial scientist would not also provide the software and raw data, equations and results.

Allowing the other scientists to confirm his simulation, and to also independently test the hypothesis using his own methods or analysis.

You are right to question why scientists would not provide all the relivant data/code for review.

If only to confirm the method and technique is repeatable.

generally science is quite closed until it's confirmed a discovery is made, this allows that scientist to claim the discovery.

Once you published and given the credit for you're discovery, then it moves (or should move) into a very OPEN format, as you suggest.

BTW: nice blog, well written.

Glyn Moody said...

thanks for the feedback (both kinds)