Reproducibility of science and open source

I'm all for open access.
I'm all for open source.
I'm all for reproducible science.
I'm all for replicable studies.

So I like that data are shared.
I like that protocols are shared.
I also really appreciate it when code is shared - but only when it is appropriate.

Times that I think are appropriate to share code are, for instance, when there's an entirely new method introduced - then I think it is important to release the code/script as source code, package, program etc so that it enables other scientists to reproduce your work or use it in their own analyses.

However, I've noticed that, often times, shared code/script are nothing more than just the authors' workflows - in which case I don't want to see it. Everyone has a different workflow and I don't want to have to get into the heads of other people to figure out exactly what I'm looking at and what the code is doing - because commonly associated with shared workflow are uncommented code. This is not only useless but a hinderance because I have to take head-ouchy inducing time to figure out what every line of code is doing...

##### Tangential rant starts here #####
The primary source of headache is R script written in a way that combines multiple operations in a single line - e.g. combining subsetting data with feeding it through some function. People, R is object oriented programming, there is no point in doing everything in a single line. Break it up. Make objects. Subset data and call them something else. Then perform functions on subsetted data objects...
##### Tangential rant ends #####

Another thing I don't want to see is a data dump, organised into a million folders with no annotations or explanations, with only an uncommented R script associated with the data in each of the million folders.

##### Tangential rant2 starts here #####
I also found that if you contact authors for clarification on such data dump, their first response is typically, "What are you planning to do with my data?" - Dude, you published it already...it's fair game, man.
##### Tangential rant2 ends #####

What I think are most useful for reproducible science are:

The prepared data - For a quick reanalysis, prepped data - e.g. missing data removed, taxa matched with phylogeny - is essential for replication.
The raw data - So that you can prepare the data in the same way the authors did but using your own methods. Or better yet, you think you have a better way to process the data - i.e. you disagree with the authors' choice of data preparation. Maybe you have additional data the original authors didn't have and you want to augment the raw data.
The protocol - I value this more than raw script or source code. I think it is very important to share a step-by-step instruction on how the analyses were conducted and what parameter values were used for inputs. People can figure out how to code up that protocol into their own workflows. After all, some labs might prefer to code things in Python or Matlab - like some wet labs might buy gels while others make their own, or prefer Corning tubes over Falcon tubes, or have totally different cell incubation systems - who knows...
The full table/list of summary statistics - So that you can compare your results and see how well they match up.

Perhaps people will disagree but these are my opinions regarding the role of open source and sharing with respect to reproducible science.

Raptor's Nest

Search This Blog

Reproducibility of science and open source

Labels

Comments

Popular posts from this blog

The difference between Lion and Tiger skulls

R for beginners and intermediate users 3: plotting with colours

Top 10 scientifically important theropod dinosaurs of all time (off the top of my head)