Skip to main content

R for beginners and intermediate users: reading and manipulating data






I had been preparing a comprehensive tutorial on how to plot in R (The R Project) with different groups differentiated in different colours, but Blogger stupidly erased my post and decided to automatically save my empty draft at that precise moment. Since I cannot reproduce the original post, I decided to break it up into a series of smaller topics.

There are plenty of R resources available in various places but I found that they are frequently one of two extremes; either too basic or too advanced.  I think of myself as an intermediate user (i.e., I can comfortably handle canned packages but want a bit more control than the default settings allow) so the type of info I find are not too helpful. So I thought it would benefit others like me if I summed up some of the simple things I learned over the last year or two.

As a first of such posts, I will deal with reading in and manipulating data.  These may be very simple and basic, but some of the things I wanted to do required a bit more than reading a manual.  I will try and explain things as simply as I can so that beginners can also find some use from these posts.

So here we go.

First, we should set up the working directory.  This is the directory (or folder) where you want R to read in data from and write out results to.  You don't have to do this but it's sometimes useful to do so.

In Windows, you can find a drop down menu "Change dir..." under the "File" menu.  In Mac's this would be under the "Miscellaneous" menu.  This prompts you to select a directory.  I don't particularly like this approach because it takes time to navigate through many levels of directories to get to the one you are looking at; e.g. select "C Drive", select "Users", select "YOUR USERNAME", select "Documents"… etc… or whatever your pathway is.

An alternative is to use the setwd() function, for instance like this:

  setwd("C:/Users/User Name/Documents/FOLDER")

Note that the pathway (C:/…) has to be within quotes (“…”) and the pathway separators are slashes (/) instead of backslashes (\) as in Windows pathway displays. If you are unsure if you have set your working directory correctly, then you can check by getting working directory, getwd().

Now that you have set your working directory we can start reading in our data. This would require that you have your data stored as a tab delimited txt file or something similar like comma delimited csv file for instance.  For this example, I will use my published dataset of theropod biting performance measures.  The txt file looks roughly as follows:


  Taxa                          B0            B1          B2           Family
  Acrocanthosaurus     0.307931296    -0.00329298   3.28E-05    Allosauroidea
  Allosaurus           0.302008604   -0.002847656   2.04E-05    Allosauroidea
  Archaeopteryx        0.142338967   -0.000870802   2.98E-06             Aves
  Bambiraptor          0.181541103      -0.001606   1.10E-05  Dromaeosauridae
  Baryonychid          0.189377202    -0.00237557   2.20E-05  Basal_Tetanurae
  Carcharodontosaurus  0.368623687   -0.005015715   5.82E-05    Allosauroidea


  .
  .
  .

The first column contains the names of the theropods, second to fourth the data and the fifth column the family names, as evident from the first row.  We want to keep this structure so we will read in the data telling R to acknowledge the first row as the header and the first column as the row names:

data <- read.table("FILENAME.txt", header=T, row.names=1)

Here the data is read in and stored as an object called “data”.  The FILENAME has to be within “”.  The bit “header=T” or “header=TRUE” specifies that the first row is a header and “row.names=1” specifies the first column as row names.  You can review your data by typing in “data” which would print out your data table, or you can type “str(data)” which will show you a compact description of the structure of your object “data”. The latter will return a list that looks like this:


  > str(data)
  'data.frame':   42 obs. of  4 variables:
   $ B0    : num  0.308 0.302 0.142 0.182 0.189 ...
   $ B1    : num  -0.00329 -0.00285 -0.00087 -0.00161 -0.00238 ...
   $ B2    : num  3.28e-05 2.04e-05 2.98e-06 1.10e-05 2.20e-05 ...
   $ Family: Factor w/ 13 levels "Allosauroidea",..: 1 1 2 6 4 1 8 8 11 5 ...


This tells us that object “data” is of the class “data.frame” with 42 observations (our 42 dinosaurs) and 4 variables (B0, B1, B2, and Family). Variables “B0”, “B1”, and “B2” are numerical data but “Family” is a factor.  For some analyses like principal components analysis, non-numerical variables like “Family” cannot be included, so we will have to exclude this variable (more on this later).  The variables (or any other content of an object) are indicated by a “$” and you can always call up an individual variable within an object, e.g. “data$B0”.  This is useful when you want to use specific components of an object for analyses (for instance a regression of B0 against B1) or plotting (e.g. B0 against B1) (more on plotting in my next post).

Next, I’d like to explain briefly the structure of R data tables. For instance, “data” is a 42 by 4 data matrix in terms of rows vs columns, which is how R handles tables; the format that R understands tables is [rows,columns].  So if you want to see the B2 value for Allosaurus then you would type “data[2,3]” because Allosaurus is the second row and B2 is the 3 column and R will return that value which is “2.04e-05”.  Similarly, if you want to review all the values for B0, then you would type “data[,1]” to call up the entire first column (or alternatively you can type “data$B0” as I’ve described above). If you want to review all the values for a given taxon (row), let’s say Allosaurus, then you would type, “data[2,]”, which returns:


  > data[2,]
                       B0              B1        B2         Family
  Allosaurus    0.3020086    -0.002847656  2.04e-05   Allosauroidea

Now we can move on to manipulating data in the simplest ways. As I’ve mentioned above, some analyses don’t like non-numerical data and we would have to eliminate the column “Family” from “data” for these analyses.  One way to do this is to compile a new table using the cbind() function like this:

data2 <- cbind(data$B0, data$B1, data$B2)

This will bind the vectors “data$B0”, “data$B1”, and “data$B2” together into a table. Unfortunately, the row names and column headers are stripped in the process so we have to assign them again.  For row names we can simply take them from “data”:

  rownames(data2) <- rownames(data)

Column names on the other hand are a bit more troublesome as there are four columns in “data” and only three in “data2”.  We have to directly name them like this:

  colnames(data2) <- c(“B0”, “B1”, “B2”)

The function cbind() also seems to create a object of class “matrix” so if you want a “data.frame” instead (which is useful if you want to use the $ operator to call individual columns) then we’d need to reassign “data2” as a data.frame object:

data2 <- data.frame(data2)

Using cbind() to create a data table of desired columns is fine just as long as the number of variables is manageable.  In many cases (such as large multivariate data sets) this is not possible, so we need to resort to an alternative, which is to delete columns or rows.  This simple procedure of deleting rows/columns is not straightforward in R and it took me a bit of searching before I found how to do it.  Let’s start with deleting a variable, in our example, the non-numerical variable “Family”.  Since family is the fourth column in “data”, we have to somehow eliminate data[,4].  It turns out that it is actually quite simple; just put a “-“ in front of the column (or row) number:

data3 <- data[,-4]

By typing in “length(data3[1,])”, which shows you the number of items in the first row in the new data set “data3”, R should return a value of “3” .  The command “str(data3)” should also give a short list with three variables.

The same can be done for rows; just put a “-“ in front of the row number you wish to eliminate.  For instance, if we want to delete Allosaurus from “data3”, then we would type:

  data4 <- data3[-2,]

We can also delete multiple rows (or columns) at once.  I will give an example first:

  data5 <- data3[-c(2,7),]

Here, I specified the second and seventh rows to be deleted from “data3”. The “c(2,7)” combines values “2” and “7” into a vector or a list; this is the format that R likes for lists of values.  So our row specification of data3[row,column] is a vector (list) including the values “2” and “7”.  And there is a “-“ in front of it to tell R to delete the values within this list. Of course, you can always simply repeat the code to produce “data4” (see above) and eventually get the same thing as “data5” but that involves some tedious coding if you have a lot of rows to eliminate.

Multiple columns can also be deleted simultaneously in a similar manner:

  data6 <- data[,-c(3,4)]

This removes columns 3 and 4 from the original data set “data” (which incidentally is still stored within R’s memory as a separate object because all the data manipulation has been stored under new names each time, i.e. “dataN”).  The resulting “data6” should now have two columns, “B0” and “B1”.

I think that’s enough for now.  In my next post I will either explain how to deal with missing data or how to plot basic X-Y plots but with colours (families plotted in different colour).







Comments

Malacoda said…
Great stuff Mambo, these things are really useful!
Raptor's Nest said…
Thanks Graeme! We need to make more of these!
Nick said…
"I think that’s enough for now. In my next post I will either explain how to deal with missing data or how to plot basic X-Y plots but with colours (families plotted in different colour)."

Can't wait to see more.

Nick

Popular posts from this blog

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis ), but also the American Cave Lion, Panthera atrox . The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox , Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour. If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this: B0 B1 B2 Family Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea Archaeopteryx 0.142 -0.000871 2.98E-06 Aves Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria Citipati 0.278 -0.00119 5.08E-06 Ovir

Hind limb proportions do not support the validity of Nanotyrannus

While it was not the main focus of their paper, Persons and Currie (2016) , in a recent paper in Scientific Reports hinted at the possibility of Nanotyrannus lancensis being a valid taxon distinct from Tyrannosaurus rex , using deviations from a regression model of lower leg length on femur length. Similar to encephalisation quotients , Persons and Currie devised a score (cursorial-limb-proportion; CLP) based on the difference between the observed lower leg length and the predicted lower leg length (from a regression model) expressed as a percentage of the observed value. The idea behind this is pretty simple in that if the observed lower leg length value is higher than that predicted for its size (femur length), then that taxon gets a high CLP score. I don't particularly like this sort of data characterisation (a straightforward regression [albeit with phylogeny, e.g. pGLS] would do the job well), but nonetheless, Persons and Currie found that when applied to Nanotyrannus , it