Skip to main content

R for beginners and intermediate users 2: extracting subsets of data

For my second post on R, I think I will address how to extract subsets of data based on some selection criterion like taxon names. For instance, I have a huge dataset of morphometric variables for at least 36 species of cats (living and fossil). Sometimes I'd like to do some stats on a subset of this dataset, like all the living cats or just on the Panthera lineage species (Panthera and Neofelis). Till recently, I've been doing most of my dataset manipulation in Excel by filtering out certain taxa from the spreadsheet and copy-pasting to a text file, which I read into R. However, you can select subsets of data in R based on taxon names.

In my dataset that I call cat, I have a column labelled Taxa which contains all my taxon names. So typing cat$Taxa would be the way to call up my taxon names.

Let's say I want to extract from my dataset cat just the data for the lion Panthera leo. The associated taxon names in cat$Taxa would be Panthera_leo. So to extract that portion of the dataset, we can type something like:

leo <- cat[cat$Taxa=="Panthera_leo",]

and I've called the extracted subset leo. This command is very similar to some data manipulation I covered in my previous post (e.g. data[2,]) with the exception that I've specified leo to be the rows ([row,]) in the dataset cat that have the Taxa column equalling 'Panthera_leo', i.e. cat$Taxa=="Panthera_leo".  In fact, typing cat$Taxa=="Panthera_leo" would return a list of TRUE/FALSE statements, where TRUE indicates those with cat$Taxa=="Panthera_leo".

Typing leo would now return a smaller subset of the dataset containing only the lion data.

Conversely, we can also select all the non-lion data like this:

not.leo <- cat[cat$Taxa!="Panthera_leo",]

Typing not.leo should now return all the data excluding the lion data.

My original dataset cat had 361 rows, leo has 17 rows and not.leo should be 361 - 17 = 344 rows.

To take this a bit further, we can even extract a subset of a dataset that overlaps with another dataset. For example, I have two datasets on theropod dinosaurs, one on morphological variables and another on biomechanical model parameters. My morphological dataset is larger than my biomechanical dataset, encompassing more taxa and specimens, for the simple reason that biomechanical modelling can only be done on specimens meeting certain criteria. But nonetheless, some specimens overlap. So we can first determine the specimens that are in both datasets using the match() function also expressed as data1 % in % data2.

Let's call the morphological dataset morph and biomechanical dataset biomech. Let's also say that the row names are the unique specimen numbers that we want to match up in the two datasets. We also want to determine the observations within the bigger dataset morph that is also present in biomech.

rownames(morph) %in% rownames(biomech)

This command compares and determines if the row names of morph are present in the row names of biomech, and returns a list of TRUE/FALSE statements. Observations in morph that are present in biomech will come back as TRUE while those absent in biomech would be shown as FALSE. Now to extract the common observations as a subset of morph, we can do something very similar to what I showed for the cat dataset above:

morph.common <- morph[rownames(morph) % in% rownames(biomech),]

The new dataset morph.common should now be a subset of morph comprising of specimens present in biomech. If morph had 92 rows and biomech had 34, and all 34 specimens in biomech were present in morph, then morph.common should have 34 rows. On the other hand, if only 20 of the specimens in morph (n rows = 92) were also present in biomech (n rows = 34) then morph.common would only have 20 rows.

These commands are proving very useful as I don't particularly have to go back to Excel so much for simple data comparison and extraction, but was a bit difficult to find. So I hope my attempt to summarise this bit of information would also be useful for someone else.

For my next post in my R 'tutorials', I'll finally try and address my original blog idea from all those months ago, i.e. how to plot in R using colours according to groupings.

Comments

Popular posts from this blog

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis ), but also the American Cave Lion, Panthera atrox . The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox , Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour. If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this: B0 B1 B2 Family Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea Archaeopteryx 0.142 -0.000871 2.98E-06 Aves Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria Citipati 0.278 -0.00119 5.08E-06 Ovir

Hind limb proportions do not support the validity of Nanotyrannus

While it was not the main focus of their paper, Persons and Currie (2016) , in a recent paper in Scientific Reports hinted at the possibility of Nanotyrannus lancensis being a valid taxon distinct from Tyrannosaurus rex , using deviations from a regression model of lower leg length on femur length. Similar to encephalisation quotients , Persons and Currie devised a score (cursorial-limb-proportion; CLP) based on the difference between the observed lower leg length and the predicted lower leg length (from a regression model) expressed as a percentage of the observed value. The idea behind this is pretty simple in that if the observed lower leg length value is higher than that predicted for its size (femur length), then that taxon gets a high CLP score. I don't particularly like this sort of data characterisation (a straightforward regression [albeit with phylogeny, e.g. pGLS] would do the job well), but nonetheless, Persons and Currie found that when applied to Nanotyrannus , it