Wednesday, January 19, 2011

R for beginners and intermediate users 2: extracting subsets of data

For my second post on R, I think I will address how to extract subsets of data based on some selection criterion like taxon names. For instance, I have a huge dataset of morphometric variables for at least 36 species of cats (living and fossil). Sometimes I'd like to do some stats on a subset of this dataset, like all the living cats or just on the Panthera lineage species (Panthera and Neofelis). Till recently, I've been doing most of my dataset manipulation in Excel by filtering out certain taxa from the spreadsheet and copy-pasting to a text file, which I read into R. However, you can select subsets of data in R based on taxon names.

In my dataset that I call cat, I have a column labelled Taxa which contains all my taxon names. So typing cat$Taxa would be the way to call up my taxon names.

Let's say I want to extract from my dataset cat just the data for the lion Panthera leo. The associated taxon names in cat$Taxa would be Panthera_leo. So to extract that portion of the dataset, we can type something like:

leo <- cat[cat$Taxa=="Panthera_leo",]

and I've called the extracted subset leo. This command is very similar to some data manipulation I covered in my previous post (e.g. data[2,]) with the exception that I've specified leo to be the rows ([row,]) in the dataset cat that have the Taxa column equalling 'Panthera_leo', i.e. cat$Taxa=="Panthera_leo".  In fact, typing cat$Taxa=="Panthera_leo" would return a list of TRUE/FALSE statements, where TRUE indicates those with cat$Taxa=="Panthera_leo".

Typing leo would now return a smaller subset of the dataset containing only the lion data.

Conversely, we can also select all the non-lion data like this:

not.leo <- cat[cat$Taxa!="Panthera_leo",]

Typing not.leo should now return all the data excluding the lion data.

My original dataset cat had 361 rows, leo has 17 rows and not.leo should be 361 - 17 = 344 rows.

To take this a bit further, we can even extract a subset of a dataset that overlaps with another dataset. For example, I have two datasets on theropod dinosaurs, one on morphological variables and another on biomechanical model parameters. My morphological dataset is larger than my biomechanical dataset, encompassing more taxa and specimens, for the simple reason that biomechanical modelling can only be done on specimens meeting certain criteria. But nonetheless, some specimens overlap. So we can first determine the specimens that are in both datasets using the match() function also expressed as data1 % in % data2.

Let's call the morphological dataset morph and biomechanical dataset biomech. Let's also say that the row names are the unique specimen numbers that we want to match up in the two datasets. We also want to determine the observations within the bigger dataset morph that is also present in biomech.

rownames(morph) %in% rownames(biomech)

This command compares and determines if the row names of morph are present in the row names of biomech, and returns a list of TRUE/FALSE statements. Observations in morph that are present in biomech will come back as TRUE while those absent in biomech would be shown as FALSE. Now to extract the common observations as a subset of morph, we can do something very similar to what I showed for the cat dataset above:

morph.common <- morph[rownames(morph) % in% rownames(biomech),]

The new dataset morph.common should now be a subset of morph comprising of specimens present in biomech. If morph had 92 rows and biomech had 34, and all 34 specimens in biomech were present in morph, then morph.common should have 34 rows. On the other hand, if only 20 of the specimens in morph (n rows = 92) were also present in biomech (n rows = 34) then morph.common would only have 20 rows.

These commands are proving very useful as I don't particularly have to go back to Excel so much for simple data comparison and extraction, but was a bit difficult to find. So I hope my attempt to summarise this bit of information would also be useful for someone else.

For my next post in my R 'tutorials', I'll finally try and address my original blog idea from all those months ago, i.e. how to plot in R using colours according to groupings.