Skip to main content

Reproducibility of science and open source

I'm all for open access.
I'm all for open source.
I'm all for reproducible science.
I'm all for replicable studies.

So I like that data are shared.
I like that protocols are shared.
I also really appreciate it when code is shared - but only when it is appropriate.

Times that I think are appropriate to share code are, for instance, when there's an entirely new method introduced - then I think it is important to release the code/script as source code, package, program etc so that it enables other scientists to reproduce your work or use it in their own analyses.

However, I've noticed that, often times, shared code/script are nothing more than just the authors' workflows - in which case I don't want to see it. Everyone has a different workflow and I don't want to have to get into the heads of other people to figure out exactly what I'm looking at and what the code is doing - because commonly associated with shared workflow are uncommented code. This is not only useless but a hinderance because I have to take head-ouchy inducing time to figure out what every line of code is doing...

#####  Tangential rant starts here #####
The primary source of headache is R script written in a way that combines multiple operations in a single line - e.g. combining subsetting data with feeding it through some function. People, R is object oriented programming, there is no point in doing everything in a single line. Break it up. Make objects. Subset data and call them something else. Then perform functions on subsetted data objects...
#####  Tangential rant ends  #####

Another thing I don't want to see is a data dump, organised into a million folders with no annotations or explanations, with only an uncommented R script associated with the data in each of the million folders.

#####  Tangential rant2 starts here #####
I also found that if you contact authors for clarification on such data dump, their first response is typically, "What are you planning to do with my data?" - Dude, you published it's fair game, man.
#####  Tangential rant2 ends  #####

What I think are most useful for reproducible science are:

  1. The prepared data - For a quick reanalysis, prepped data - e.g. missing data removed, taxa matched with phylogeny - is essential for replication.
  2. The raw data - So that you can prepare the data in the same way the authors did but using your own methods. Or better yet, you think you have a better way to process the data - i.e. you disagree with the authors' choice of data preparation. Maybe you have additional data the original authors didn't have and you want to augment the raw data.
  3. The protocol - I value this more than raw script or source code. I think it is very important to share a step-by-step instruction on how the analyses were conducted and what parameter values were used for inputs. People can figure out how to code up that protocol into their own workflows. After all, some labs might prefer to code things in Python or Matlab - like some wet labs might buy gels while others make their own, or prefer Corning tubes over Falcon tubes, or have totally different cell incubation systems - who knows...
  4. The full table/list of summary statistics - So that you can compare your results and see how well they match up.

Perhaps people will disagree but these are my opinions regarding the role of open source and sharing with respect to reproducible science.


Popular posts from this blog

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis ), but also the American Cave Lion, Panthera atrox . The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox , Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour. If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this: B0 B1 B2 Family Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea Archaeopteryx 0.142 -0.000871 2.98E-06 Aves Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria Citipati 0.278 -0.00119 5.08E-06 Ovir

Hind limb proportions do not support the validity of Nanotyrannus

While it was not the main focus of their paper, Persons and Currie (2016) , in a recent paper in Scientific Reports hinted at the possibility of Nanotyrannus lancensis being a valid taxon distinct from Tyrannosaurus rex , using deviations from a regression model of lower leg length on femur length. Similar to encephalisation quotients , Persons and Currie devised a score (cursorial-limb-proportion; CLP) based on the difference between the observed lower leg length and the predicted lower leg length (from a regression model) expressed as a percentage of the observed value. The idea behind this is pretty simple in that if the observed lower leg length value is higher than that predicted for its size (femur length), then that taxon gets a high CLP score. I don't particularly like this sort of data characterisation (a straightforward regression [albeit with phylogeny, e.g. pGLS] would do the job well), but nonetheless, Persons and Currie found that when applied to Nanotyrannus , it