Skip to main content

R for beginners and intermediate users 2: extracting subsets of data

For my second post on R, I think I will address how to extract subsets of data based on some selection criterion like taxon names. For instance, I have a huge dataset of morphometric variables for at least 36 species of cats (living and fossil). Sometimes I'd like to do some stats on a subset of this dataset, like all the living cats or just on the Panthera lineage species (Panthera and Neofelis). Till recently, I've been doing most of my dataset manipulation in Excel by filtering out certain taxa from the spreadsheet and copy-pasting to a text file, which I read into R. However, you can select subsets of data in R based on taxon names.

In my dataset that I call cat, I have a column labelled Taxa which contains all my taxon names. So typing cat$Taxa would be the way to call up my taxon names.

Let's say I want to extract from my dataset cat just the data for the lion Panthera leo. The associated taxon names in cat$Taxa would be Panthera_leo. So to extract that portion of the dataset, we can type something like:

leo <- cat[cat$Taxa=="Panthera_leo",]

and I've called the extracted subset leo. This command is very similar to some data manipulation I covered in my previous post (e.g. data[2,]) with the exception that I've specified leo to be the rows ([row,]) in the dataset cat that have the Taxa column equalling 'Panthera_leo', i.e. cat$Taxa=="Panthera_leo".  In fact, typing cat$Taxa=="Panthera_leo" would return a list of TRUE/FALSE statements, where TRUE indicates those with cat$Taxa=="Panthera_leo".

Typing leo would now return a smaller subset of the dataset containing only the lion data.

Conversely, we can also select all the non-lion data like this:

not.leo <- cat[cat$Taxa!="Panthera_leo",]

Typing not.leo should now return all the data excluding the lion data.

My original dataset cat had 361 rows, leo has 17 rows and not.leo should be 361 - 17 = 344 rows.

To take this a bit further, we can even extract a subset of a dataset that overlaps with another dataset. For example, I have two datasets on theropod dinosaurs, one on morphological variables and another on biomechanical model parameters. My morphological dataset is larger than my biomechanical dataset, encompassing more taxa and specimens, for the simple reason that biomechanical modelling can only be done on specimens meeting certain criteria. But nonetheless, some specimens overlap. So we can first determine the specimens that are in both datasets using the match() function also expressed as data1 % in % data2.

Let's call the morphological dataset morph and biomechanical dataset biomech. Let's also say that the row names are the unique specimen numbers that we want to match up in the two datasets. We also want to determine the observations within the bigger dataset morph that is also present in biomech.

rownames(morph) %in% rownames(biomech)

This command compares and determines if the row names of morph are present in the row names of biomech, and returns a list of TRUE/FALSE statements. Observations in morph that are present in biomech will come back as TRUE while those absent in biomech would be shown as FALSE. Now to extract the common observations as a subset of morph, we can do something very similar to what I showed for the cat dataset above:

morph.common <- morph[rownames(morph) % in% rownames(biomech),]

The new dataset morph.common should now be a subset of morph comprising of specimens present in biomech. If morph had 92 rows and biomech had 34, and all 34 specimens in biomech were present in morph, then morph.common should have 34 rows. On the other hand, if only 20 of the specimens in morph (n rows = 92) were also present in biomech (n rows = 34) then morph.common would only have 20 rows.

These commands are proving very useful as I don't particularly have to go back to Excel so much for simple data comparison and extraction, but was a bit difficult to find. So I hope my attempt to summarise this bit of information would also be useful for someone else.

For my next post in my R 'tutorials', I'll finally try and address my original blog idea from all those months ago, i.e. how to plot in R using colours according to groupings.


Popular posts from this blog

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour.

If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this:

B0 B1 B2 Family
Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea
Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea
Archaeopteryx 0.142 -0.000871 2.98E-06 Aves
Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae
Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae
Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea
Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria
Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria
Citipati 0.278 -0.00119 5.08E-06 Oviraptorosauria

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis), but also the American Cave Lion, Panthera atrox. The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox, Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help fro…

The fundamental problem with the Star Wars franchise

The sequel Star Wars Trilogy (so far Episodes VII and VIII) has been getting a lot of hate on the internet. While I think most of the hatred is just dreadful and ridiculous (like “Social Justice Warriors” taking over and making Star Wars too diverse and featuring too many strong female characters? Get out of here!), there are some legitimate criticisms, that I can relate to.

One such criticism is, that the new trilogy (especially The Last Jedi) effectively undoes the ending of The Return of the Jedi - in some ways rendering the struggles and sacrifices of the Rebel Alliance meaningless. As a viewer who watched the original trilogy conclude with the death of the Emperor, I presumed that the Empire came to an end, and with it the end of tyranny. I presumed that democracy would be reinstated in the form of a New Republic and the reconstruction of a New Jedi Order with Jedi Master Luke Skywalker at the helm. Peace is restored and all is good. I think that’s a nice ending.

But then the new S…