Skip to main content

R for beginners and intermediate users 2: extracting subsets of data

For my second post on R, I think I will address how to extract subsets of data based on some selection criterion like taxon names. For instance, I have a huge dataset of morphometric variables for at least 36 species of cats (living and fossil). Sometimes I'd like to do some stats on a subset of this dataset, like all the living cats or just on the Panthera lineage species (Panthera and Neofelis). Till recently, I've been doing most of my dataset manipulation in Excel by filtering out certain taxa from the spreadsheet and copy-pasting to a text file, which I read into R. However, you can select subsets of data in R based on taxon names.

In my dataset that I call cat, I have a column labelled Taxa which contains all my taxon names. So typing cat$Taxa would be the way to call up my taxon names.

Let's say I want to extract from my dataset cat just the data for the lion Panthera leo. The associated taxon names in cat$Taxa would be Panthera_leo. So to extract that portion of the dataset, we can type something like:

leo <- cat[cat$Taxa=="Panthera_leo",]

and I've called the extracted subset leo. This command is very similar to some data manipulation I covered in my previous post (e.g. data[2,]) with the exception that I've specified leo to be the rows ([row,]) in the dataset cat that have the Taxa column equalling 'Panthera_leo', i.e. cat$Taxa=="Panthera_leo".  In fact, typing cat$Taxa=="Panthera_leo" would return a list of TRUE/FALSE statements, where TRUE indicates those with cat$Taxa=="Panthera_leo".

Typing leo would now return a smaller subset of the dataset containing only the lion data.

Conversely, we can also select all the non-lion data like this:

not.leo <- cat[cat$Taxa!="Panthera_leo",]

Typing not.leo should now return all the data excluding the lion data.

My original dataset cat had 361 rows, leo has 17 rows and not.leo should be 361 - 17 = 344 rows.

To take this a bit further, we can even extract a subset of a dataset that overlaps with another dataset. For example, I have two datasets on theropod dinosaurs, one on morphological variables and another on biomechanical model parameters. My morphological dataset is larger than my biomechanical dataset, encompassing more taxa and specimens, for the simple reason that biomechanical modelling can only be done on specimens meeting certain criteria. But nonetheless, some specimens overlap. So we can first determine the specimens that are in both datasets using the match() function also expressed as data1 % in % data2.

Let's call the morphological dataset morph and biomechanical dataset biomech. Let's also say that the row names are the unique specimen numbers that we want to match up in the two datasets. We also want to determine the observations within the bigger dataset morph that is also present in biomech.

rownames(morph) %in% rownames(biomech)

This command compares and determines if the row names of morph are present in the row names of biomech, and returns a list of TRUE/FALSE statements. Observations in morph that are present in biomech will come back as TRUE while those absent in biomech would be shown as FALSE. Now to extract the common observations as a subset of morph, we can do something very similar to what I showed for the cat dataset above:

morph.common <- morph[rownames(morph) % in% rownames(biomech),]

The new dataset morph.common should now be a subset of morph comprising of specimens present in biomech. If morph had 92 rows and biomech had 34, and all 34 specimens in biomech were present in morph, then morph.common should have 34 rows. On the other hand, if only 20 of the specimens in morph (n rows = 92) were also present in biomech (n rows = 34) then morph.common would only have 20 rows.

These commands are proving very useful as I don't particularly have to go back to Excel so much for simple data comparison and extraction, but was a bit difficult to find. So I hope my attempt to summarise this bit of information would also be useful for someone else.

For my next post in my R 'tutorials', I'll finally try and address my original blog idea from all those months ago, i.e. how to plot in R using colours according to groupings.


Popular posts from this blog

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour.

If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this:

B0 B1 B2 Family
Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea
Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea
Archaeopteryx 0.142 -0.000871 2.98E-06 Aves
Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae
Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae
Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea
Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria
Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria
Citipati 0.278 -0.00119 5.08E-06 Oviraptorosauria

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis), but also the American Cave Lion, Panthera atrox. The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox, Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help fro…

R for beginners and intermediate users: reading and manipulating data

I had been preparing a comprehensive tutorial on how to plot in R (The R Project) with different groups differentiated in different colours, but Blogger stupidly erased my post and decided to automatically save my empty draft at that precise moment. Since I cannot reproduce the original post, I decided to break it up into a series of smaller topics.
There are plenty of R resources available in various places but I found that they are frequently one of two extremes; either too basic or too advanced.  I think of myself as an intermediate user (i.e., I can comfortably handle canned packages but want a bit more control than the default settings allow) so the type of info I find are not too helpful. So I thought it would benefit others like me if I summed up some of the simple things I learned over the last year or two.
As a first of such posts, I will deal with reading in and manipulating data.  These may be very simple and basic, but some of the things I wanted to do required a bit more th…