Skip to main content

R for beginners and intermediate users 4: object oriented programming

The topic of this post was mentioned in a tangential rant featured in my previous post, and I thought I might as well expand on this a bit. I'm not going to talk about programming language model or anything like that since I'm not a programmer - rather, I will treat this more like a tutorial or a "Pro-tip" kind of post.

I will be focusing on an aspect of R that is often taken for granted and maybe not well known by entry-level users. That is, R is an object-oriented programming language. If you already know this, then this blog post is not for you.

First, I'll list out a few interesting/useful features of R:
  1. R is interpretable
  2. R is based on vectors
  3. R can utilise functions (e.g. functional programming)
  4. R utilises objects (object-oriented programming)
Like I've already mentioned, this post will focus on point 4, that R is an object-oriented programming language (or simply that R can be object-oriented if you don't want to call R a programming language...).

We can start with a very simple situation.

Let's assign a name to a value in R:

> x <- 1

There you go. That's an object. We can call that object x and add another value 1 to it and assign the result to y.

> y <- x + 1

So that's an object oriented programming right there. Instead of displaying the result of the operation 

> 1 + 1

which will return

[1] 2

We've called an object x and assigned the outcome of the operation to an object y. This kind of programming is core to R.

Another example:

Let's say you want to calculate the mean value of a sequence from 1 to 10. You can achieve this like:

> mean(1:10)
[1] 5.5

You can also do it like this:

> x <- 1:10
> x
[1]  1  2  3  4  5  6  7  8  9 10
> y <- mean(x)
> y
[1] 5.5

You might think that's an extra line of code compared to the first variation, but if you make it a habit to code using the second object-oriented type of coding, then you will probably find that as your coding gets more complicated, object-oriented programming will make everything easier to keep track of and maybe more importantly, easier to debug.

For starters, you've stored the outcome of the operation as object y so you can use it later on if you need it again, and you won't have to type that operation again (which saves you from unnecessary typos).

Let's go for a little bit more advanced example.

Suppose we have a data.frame object df which contains some morphometric data in Darwin's finches. Let's say that you want to subset the data to those rows (species) that have wingL greater than or equal to 4:

> df[df$wingL >= 4, ]

                      Taxon Name_in_Tree    wingL  tarsusL  culmenL    beakD   gonysW
1     Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
2      Geospiza_conirostris  conirostris 4.349867 2.984200 2.654400 2.513800 2.360167
3       Geospiza_difficilis   difficilis 4.224067 2.898917 2.277183 2.011100 1.929983
4         Geospiza_scandens     scandens 4.261222 2.929033 2.621789 2.144700 2.036944
5           Geospiza_fortis       fortis 4.244008 2.894717 2.407025 2.362658 2.221867
6       Geospiza_fuliginosa   fuliginosa 4.132957 2.806514 2.094971 1.941157 1.845379
7      Camarhynchus_pallida      pallida 4.265425 3.089450 2.430250 2.016350 1.949125
9     Camarhynchus_parvulus     parvulus 4.131600 2.973060 1.974420 1.873540 1.813340
10      Camarhynchus_pauper       pauper 4.232500 3.035900 2.187000 2.073400 1.962100
11    Pinaroloxias_inornata Pinaroloxias 4.188600 2.980200 2.311100 1.547500 1.630100
12 Platyspiza_crassirostris   Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443
13  Camarhynchus_psittacula   psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

And then if you wanted to just extract the species names and calculate the mean tarsusL you can do this:

> df[df$wingL >= 4, ]$Taxon
[1] "Geospiza_magnirostris"    "Geospiza_conirostris"     "Geospiza_difficilis"      "Geospiza_scandens"        "Geospiza_fortis"          "Geospiza_fuliginosa"     
 [7] "Camarhynchus_pallida"     "Camarhynchus_parvulus"    "Camarhynchus_pauper"      "Pinaroloxias_inornata"    "Platyspiza_crassirostris" "Camarhynchus_psittacula" 

> mean(df[df$wingL >= 4, ]$tarsusL)
[1] 2.995884

Up to here, it might not be too bad to subset by the condition at every operation, but this can get annoying if you needed to do something a bit more engaging, for instance, subset the data as above but then further subset to only return data above the mean tarsusL within the subsetted data:

> df[df$wingL >= 4, ][df[df$wingL >= 4, ]$tarsusL > mean(df[df$wingL >= 4, ]$tarsusL), ]
                      Taxon Name_in_Tree    wingL  tarsusL  culmenL    beakD   gonysW
1     Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
7      Camarhynchus_pallida      pallida 4.265425 3.089450 2.430250 2.016350 1.949125
10      Camarhynchus_pauper       pauper 4.232500 3.035900 2.187000 2.073400 1.962100
12 Platyspiza_crassirostris   Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443
13  Camarhynchus_psittacula   psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

That can get confusing and is really error prone - I actually got it wrong the first couple of trials.

What I would do instead is:

# set a condition where wingL >= 4
> cond1 <- df$wingL >= 4
# see what that looks like
> cond1
[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
# it's a logical (TRUE/FALSE) vector

# Now subset according to the logical condition cond1 and call that object df1
> df1 <- df[cond1, ]

# see what the subsetted data df1 looks like
> df1
                      Taxon Name_in_Tree    wingL  tarsusL  culmenL    beakD   gonysW
1     Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
2      Geospiza_conirostris  conirostris 4.349867 2.984200 2.654400 2.513800 2.360167
3       Geospiza_difficilis   difficilis 4.224067 2.898917 2.277183 2.011100 1.929983
4         Geospiza_scandens     scandens 4.261222 2.929033 2.621789 2.144700 2.036944
5           Geospiza_fortis       fortis 4.244008 2.894717 2.407025 2.362658 2.221867
6       Geospiza_fuliginosa   fuliginosa 4.132957 2.806514 2.094971 1.941157 1.845379
7      Camarhynchus_pallida      pallida 4.265425 3.089450 2.430250 2.016350 1.949125
9     Camarhynchus_parvulus     parvulus 4.131600 2.973060 1.974420 1.873540 1.813340
10      Camarhynchus_pauper       pauper 4.232500 3.035900 2.187000 2.073400 1.962100
11    Pinaroloxias_inornata Pinaroloxias 4.188600 2.980200 2.311100 1.547500 1.630100
12 Platyspiza_crassirostris   Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443
13  Camarhynchus_psittacula   psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

# looks identical to df[df$wingL >= 4, ] above 

# assign an object that is the mean of tarsusL from the subsetted data df1
> mean.tl <- mean(df1$tarsusL)

# set a second condition: tarsusL in df1 that is greater than mean.tl
> cond2 <- df1$tarsusL > mean.tl
# view the condition
> cond2
[1]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE

# subset df1 according to condition cond2 and call that object df2
> df2 <- df1[cond2, ]
# see what df2 looks like
> df2
                      Taxon Name_in_Tree    wingL  tarsusL  culmenL    beakD   gonysW
1     Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
7      Camarhynchus_pallida      pallida 4.265425 3.089450 2.430250 2.016350 1.949125
10      Camarhynchus_pauper       pauper 4.232500 3.035900 2.187000 2.073400 1.962100
12 Platyspiza_crassirostris   Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443
13  Camarhynchus_psittacula   psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

There.
Really clean, readable and easy to see what stage of your operations you're at.

The second series of code above is what we'd call object-oriented programming. By placing the logical conditions as separate R objects, the subsetting step becomes really clean, and it's easier to check that you've got the right conditions set up. Most important of all, it reduces on typing error, especially if you're subsetting within a data.frame and you get confused about indexing column names using $, e.g. df[df$wingL >= 4, ].

If you can switch to object-oriented programming in R, that will make your life a heck of a lot easier. R is really geared towards this kind of coding so you might as well use it!

Remember, efficient coding stems from inherent laziness - i.e. you don't want to repeat menial tasks too much.

Comments

Popular posts from this blog

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis ), but also the American Cave Lion, Panthera atrox . The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox , Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour. If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this: B0 B1 B2 Family Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea Archaeopteryx 0.142 -0.000871 2.98E-06 Aves Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria Citipati 0.278 -0.00119 5.08E-06 Ovir

Hind limb proportions do not support the validity of Nanotyrannus

While it was not the main focus of their paper, Persons and Currie (2016) , in a recent paper in Scientific Reports hinted at the possibility of Nanotyrannus lancensis being a valid taxon distinct from Tyrannosaurus rex , using deviations from a regression model of lower leg length on femur length. Similar to encephalisation quotients , Persons and Currie devised a score (cursorial-limb-proportion; CLP) based on the difference between the observed lower leg length and the predicted lower leg length (from a regression model) expressed as a percentage of the observed value. The idea behind this is pretty simple in that if the observed lower leg length value is higher than that predicted for its size (femur length), then that taxon gets a high CLP score. I don't particularly like this sort of data characterisation (a straightforward regression [albeit with phylogeny, e.g. pGLS] would do the job well), but nonetheless, Persons and Currie found that when applied to Nanotyrannus , it