Skip to main content

R for beginners and intermediate users 4: object oriented programming

The topic of this post was mentioned in a tangential rant featured in my previous post, and I thought I might as well expand on this a bit. I'm not going to talk about programming language model or anything like that since I'm not a programmer - rather, I will treat this more like a tutorial or a "Pro-tip" kind of post.

I will be focusing on an aspect of R that is often taken for granted and maybe not well known by entry-level users. That is, R is an object-oriented programming language. If you already know this, then this blog post is not for you.

First, I'll list out a few interesting/useful features of R:
  1. R is interpretable
  2. R is based on vectors
  3. R can utilise functions (e.g. functional programming)
  4. R utilises objects (object-oriented programming)
Like I've already mentioned, this post will focus on point 4, that R is an object-oriented programming language (or simply that R can be object-oriented if you don't want to call R a programming language...).

We can start with a very simple situation.

Let's assign a name to a value in R:

> x <- 1

There you go. That's an object. We can call that object x and add another value 1 to it and assign the result to y.

> y <- x + 1

So that's an object oriented programming right there. Instead of displaying the result of the operation 

> 1 + 1

which will return

[1] 2

We've called an object x and assigned the outcome of the operation to an object y. This kind of programming is core to R.

Another example:

Let's say you want to calculate the mean value of a sequence from 1 to 10. You can achieve this like:

> mean(1:10)
[1] 5.5

You can also do it like this:

> x <- 1:10
> x
[1]  1  2  3  4  5  6  7  8  9 10
> y <- mean(x)
> y
[1] 5.5

You might think that's an extra line of code compared to the first variation, but if you make it a habit to code using the second object-oriented type of coding, then you will probably find that as your coding gets more complicated, object-oriented programming will make everything easier to keep track of and maybe more importantly, easier to debug.

For starters, you've stored the outcome of the operation as object y so you can use it later on if you need it again, and you won't have to type that operation again (which saves you from unnecessary typos).

Let's go for a little bit more advanced example.

Suppose we have a data.frame object df which contains some morphometric data in Darwin's finches. Let's say that you want to subset the data to those rows (species) that have wingL greater than or equal to 4:

> df[df$wingL >= 4, ]

                      Taxon Name_in_Tree    wingL  tarsusL  culmenL    beakD   gonysW
1     Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
2      Geospiza_conirostris  conirostris 4.349867 2.984200 2.654400 2.513800 2.360167
3       Geospiza_difficilis   difficilis 4.224067 2.898917 2.277183 2.011100 1.929983
4         Geospiza_scandens     scandens 4.261222 2.929033 2.621789 2.144700 2.036944
5           Geospiza_fortis       fortis 4.244008 2.894717 2.407025 2.362658 2.221867
6       Geospiza_fuliginosa   fuliginosa 4.132957 2.806514 2.094971 1.941157 1.845379
7      Camarhynchus_pallida      pallida 4.265425 3.089450 2.430250 2.016350 1.949125
9     Camarhynchus_parvulus     parvulus 4.131600 2.973060 1.974420 1.873540 1.813340
10      Camarhynchus_pauper       pauper 4.232500 3.035900 2.187000 2.073400 1.962100
11    Pinaroloxias_inornata Pinaroloxias 4.188600 2.980200 2.311100 1.547500 1.630100
12 Platyspiza_crassirostris   Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443
13  Camarhynchus_psittacula   psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

And then if you wanted to just extract the species names and calculate the mean tarsusL you can do this:

> df[df$wingL >= 4, ]$Taxon
[1] "Geospiza_magnirostris"    "Geospiza_conirostris"     "Geospiza_difficilis"      "Geospiza_scandens"        "Geospiza_fortis"          "Geospiza_fuliginosa"     
 [7] "Camarhynchus_pallida"     "Camarhynchus_parvulus"    "Camarhynchus_pauper"      "Pinaroloxias_inornata"    "Platyspiza_crassirostris" "Camarhynchus_psittacula" 

> mean(df[df$wingL >= 4, ]$tarsusL)
[1] 2.995884

Up to here, it might not be too bad to subset by the condition at every operation, but this can get annoying if you needed to do something a bit more engaging, for instance, subset the data as above but then further subset to only return data above the mean tarsusL within the subsetted data:

> df[df$wingL >= 4, ][df[df$wingL >= 4, ]$tarsusL > mean(df[df$wingL >= 4, ]$tarsusL), ]
                      Taxon Name_in_Tree    wingL  tarsusL  culmenL    beakD   gonysW
1     Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
7      Camarhynchus_pallida      pallida 4.265425 3.089450 2.430250 2.016350 1.949125
10      Camarhynchus_pauper       pauper 4.232500 3.035900 2.187000 2.073400 1.962100
12 Platyspiza_crassirostris   Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443
13  Camarhynchus_psittacula   psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

That can get confusing and is really error prone - I actually got it wrong the first couple of trials.

What I would do instead is:

# set a condition where wingL >= 4
> cond1 <- df$wingL >= 4
# see what that looks like
> cond1
[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
# it's a logical (TRUE/FALSE) vector

# Now subset according to the logical condition cond1 and call that object df1
> df1 <- df[cond1, ]

# see what the subsetted data df1 looks like
> df1
                      Taxon Name_in_Tree    wingL  tarsusL  culmenL    beakD   gonysW
1     Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
2      Geospiza_conirostris  conirostris 4.349867 2.984200 2.654400 2.513800 2.360167
3       Geospiza_difficilis   difficilis 4.224067 2.898917 2.277183 2.011100 1.929983
4         Geospiza_scandens     scandens 4.261222 2.929033 2.621789 2.144700 2.036944
5           Geospiza_fortis       fortis 4.244008 2.894717 2.407025 2.362658 2.221867
6       Geospiza_fuliginosa   fuliginosa 4.132957 2.806514 2.094971 1.941157 1.845379
7      Camarhynchus_pallida      pallida 4.265425 3.089450 2.430250 2.016350 1.949125
9     Camarhynchus_parvulus     parvulus 4.131600 2.973060 1.974420 1.873540 1.813340
10      Camarhynchus_pauper       pauper 4.232500 3.035900 2.187000 2.073400 1.962100
11    Pinaroloxias_inornata Pinaroloxias 4.188600 2.980200 2.311100 1.547500 1.630100
12 Platyspiza_crassirostris   Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443
13  Camarhynchus_psittacula   psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

# looks identical to df[df$wingL >= 4, ] above 

# assign an object that is the mean of tarsusL from the subsetted data df1
> mean.tl <- mean(df1$tarsusL)

# set a second condition: tarsusL in df1 that is greater than mean.tl
> cond2 <- df1$tarsusL > mean.tl
# view the condition
> cond2
[1]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE

# subset df1 according to condition cond2 and call that object df2
> df2 <- df1[cond2, ]
# see what df2 looks like
> df2
                      Taxon Name_in_Tree    wingL  tarsusL  culmenL    beakD   gonysW
1     Geospiza_magnirostris magnirostris 4.404200 3.038950 2.724667 2.823767 2.675983
7      Camarhynchus_pallida      pallida 4.265425 3.089450 2.430250 2.016350 1.949125
10      Camarhynchus_pauper       pauper 4.232500 3.035900 2.187000 2.073400 1.962100
12 Platyspiza_crassirostris   Platyspiza 4.419686 3.270543 2.331471 2.347471 2.282443
13  Camarhynchus_psittacula   psittacula 4.235020 3.049120 2.259640 2.230040 2.073940

There.
Really clean, readable and easy to see what stage of your operations you're at.

The second series of code above is what we'd call object-oriented programming. By placing the logical conditions as separate R objects, the subsetting step becomes really clean, and it's easier to check that you've got the right conditions set up. Most important of all, it reduces on typing error, especially if you're subsetting within a data.frame and you get confused about indexing column names using $, e.g. df[df$wingL >= 4, ].

If you can switch to object-oriented programming in R, that will make your life a heck of a lot easier. R is really geared towards this kind of coding so you might as well use it!

Remember, efficient coding stems from inherent laziness - i.e. you don't want to repeat menial tasks too much.

Comments

Popular posts from this blog

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour.

If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this:

B0 B1 B2 Family
Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea
Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea
Archaeopteryx 0.142 -0.000871 2.98E-06 Aves
Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae
Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae
Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea
Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria
Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria
Citipati 0.278 -0.00119 5.08E-06 Oviraptorosauria
Coelophysi…

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis), but also the American Cave Lion, Panthera atrox. The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox, Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help fro…

Top 10 scientifically important theropod dinosaurs of all time (off the top of my head)

I thought I'd do a fun post for once. And since list based articles are the norm for fun on the internet, I thought I'd do one on dinosaurs, but given that I know most about theropods, I've decided to restrict my list to theropods (...maybe in a future post, I'll do other clades).

My ranking is based mostly on scientific importance so it may not reflect awesomeness, and it is obviously subjective as to how I rank importance to science. For instance, interesting discoveries or unique palaeobiology are ranked relatively low compared to wealth of information and data or completely revolutionising our understanding of the evolution of theropods.

So here are my top 10 scientifically important theropod dinosaurs of all time (off the top of my head)

10. Megalosaurus

Being the first dinosaur to be named, Megalosaurus automatically deserves a spot on this list, but given the fragmentary nature of known fossil specimens, and being mostly useless as a meaningful source for biologi…