Friday, September 10, 2010

R for beginners and intermediate users: reading and manipulating data






I had been preparing a comprehensive tutorial on how to plot in R (The R Project) with different groups differentiated in different colours, but Blogger stupidly erased my post and decided to automatically save my empty draft at that precise moment. Since I cannot reproduce the original post, I decided to break it up into a series of smaller topics.

There are plenty of R resources available in various places but I found that they are frequently one of two extremes; either too basic or too advanced.  I think of myself as an intermediate user (i.e., I can comfortably handle canned packages but want a bit more control than the default settings allow) so the type of info I find are not too helpful. So I thought it would benefit others like me if I summed up some of the simple things I learned over the last year or two.

As a first of such posts, I will deal with reading in and manipulating data.  These may be very simple and basic, but some of the things I wanted to do required a bit more than reading a manual.  I will try and explain things as simply as I can so that beginners can also find some use from these posts.

So here we go.

First, we should set up the working directory.  This is the directory (or folder) where you want R to read in data from and write out results to.  You don't have to do this but it's sometimes useful to do so.

In Windows, you can find a drop down menu "Change dir..." under the "File" menu.  In Mac's this would be under the "Miscellaneous" menu.  This prompts you to select a directory.  I don't particularly like this approach because it takes time to navigate through many levels of directories to get to the one you are looking at; e.g. select "C Drive", select "Users", select "YOUR USERNAME", select "Documents"… etc… or whatever your pathway is.

An alternative is to use the setwd() function, for instance like this:

  setwd("C:/Users/User Name/Documents/FOLDER")

Note that the pathway (C:/…) has to be within quotes (“…”) and the pathway separators are slashes (/) instead of backslashes (\) as in Windows pathway displays. If you are unsure if you have set your working directory correctly, then you can check by getting working directory, getwd().

Now that you have set your working directory we can start reading in our data. This would require that you have your data stored as a tab delimited txt file or something similar like comma delimited csv file for instance.  For this example, I will use my published dataset of theropod biting performance measures.  The txt file looks roughly as follows:


  Taxa                          B0            B1          B2           Family
  Acrocanthosaurus     0.307931296    -0.00329298   3.28E-05    Allosauroidea
  Allosaurus           0.302008604   -0.002847656   2.04E-05    Allosauroidea
  Archaeopteryx        0.142338967   -0.000870802   2.98E-06             Aves
  Bambiraptor          0.181541103      -0.001606   1.10E-05  Dromaeosauridae
  Baryonychid          0.189377202    -0.00237557   2.20E-05  Basal_Tetanurae
  Carcharodontosaurus  0.368623687   -0.005015715   5.82E-05    Allosauroidea


  .
  .
  .

The first column contains the names of the theropods, second to fourth the data and the fifth column the family names, as evident from the first row.  We want to keep this structure so we will read in the data telling R to acknowledge the first row as the header and the first column as the row names:

data <- read.table("FILENAME.txt", header=T, row.names=1)

Here the data is read in and stored as an object called “data”.  The FILENAME has to be within “”.  The bit “header=T” or “header=TRUE” specifies that the first row is a header and “row.names=1” specifies the first column as row names.  You can review your data by typing in “data” which would print out your data table, or you can type “str(data)” which will show you a compact description of the structure of your object “data”. The latter will return a list that looks like this:


  > str(data)
  'data.frame':   42 obs. of  4 variables:
   $ B0    : num  0.308 0.302 0.142 0.182 0.189 ...
   $ B1    : num  -0.00329 -0.00285 -0.00087 -0.00161 -0.00238 ...
   $ B2    : num  3.28e-05 2.04e-05 2.98e-06 1.10e-05 2.20e-05 ...
   $ Family: Factor w/ 13 levels "Allosauroidea",..: 1 1 2 6 4 1 8 8 11 5 ...


This tells us that object “data” is of the class “data.frame” with 42 observations (our 42 dinosaurs) and 4 variables (B0, B1, B2, and Family). Variables “B0”, “B1”, and “B2” are numerical data but “Family” is a factor.  For some analyses like principal components analysis, non-numerical variables like “Family” cannot be included, so we will have to exclude this variable (more on this later).  The variables (or any other content of an object) are indicated by a “$” and you can always call up an individual variable within an object, e.g. “data$B0”.  This is useful when you want to use specific components of an object for analyses (for instance a regression of B0 against B1) or plotting (e.g. B0 against B1) (more on plotting in my next post).

Next, I’d like to explain briefly the structure of R data tables. For instance, “data” is a 42 by 4 data matrix in terms of rows vs columns, which is how R handles tables; the format that R understands tables is [rows,columns].  So if you want to see the B2 value for Allosaurus then you would type “data[2,3]” because Allosaurus is the second row and B2 is the 3 column and R will return that value which is “2.04e-05”.  Similarly, if you want to review all the values for B0, then you would type “data[,1]” to call up the entire first column (or alternatively you can type “data$B0” as I’ve described above). If you want to review all the values for a given taxon (row), let’s say Allosaurus, then you would type, “data[2,]”, which returns:


  > data[2,]
                       B0              B1        B2         Family
  Allosaurus    0.3020086    -0.002847656  2.04e-05   Allosauroidea

Now we can move on to manipulating data in the simplest ways. As I’ve mentioned above, some analyses don’t like non-numerical data and we would have to eliminate the column “Family” from “data” for these analyses.  One way to do this is to compile a new table using the cbind() function like this:

data2 <- cbind(data$B0, data$B1, data$B2)

This will bind the vectors “data$B0”, “data$B1”, and “data$B2” together into a table. Unfortunately, the row names and column headers are stripped in the process so we have to assign them again.  For row names we can simply take them from “data”:

  rownames(data2) <- rownames(data)

Column names on the other hand are a bit more troublesome as there are four columns in “data” and only three in “data2”.  We have to directly name them like this:

  colnames(data2) <- c(“B0”, “B1”, “B2”)

The function cbind() also seems to create a object of class “matrix” so if you want a “data.frame” instead (which is useful if you want to use the $ operator to call individual columns) then we’d need to reassign “data2” as a data.frame object:

data2 <- data.frame(data2)

Using cbind() to create a data table of desired columns is fine just as long as the number of variables is manageable.  In many cases (such as large multivariate data sets) this is not possible, so we need to resort to an alternative, which is to delete columns or rows.  This simple procedure of deleting rows/columns is not straightforward in R and it took me a bit of searching before I found how to do it.  Let’s start with deleting a variable, in our example, the non-numerical variable “Family”.  Since family is the fourth column in “data”, we have to somehow eliminate data[,4].  It turns out that it is actually quite simple; just put a “-“ in front of the column (or row) number:

data3 <- data[,-4]

By typing in “length(data3[1,])”, which shows you the number of items in the first row in the new data set “data3”, R should return a value of “3” .  The command “str(data3)” should also give a short list with three variables.

The same can be done for rows; just put a “-“ in front of the row number you wish to eliminate.  For instance, if we want to delete Allosaurus from “data3”, then we would type:

  data4 <- data3[-2,]

We can also delete multiple rows (or columns) at once.  I will give an example first:

  data5 <- data3[-c(2,7),]

Here, I specified the second and seventh rows to be deleted from “data3”. The “c(2,7)” combines values “2” and “7” into a vector or a list; this is the format that R likes for lists of values.  So our row specification of data3[row,column] is a vector (list) including the values “2” and “7”.  And there is a “-“ in front of it to tell R to delete the values within this list. Of course, you can always simply repeat the code to produce “data4” (see above) and eventually get the same thing as “data5” but that involves some tedious coding if you have a lot of rows to eliminate.

Multiple columns can also be deleted simultaneously in a similar manner:

  data6 <- data[,-c(3,4)]

This removes columns 3 and 4 from the original data set “data” (which incidentally is still stored within R’s memory as a separate object because all the data manipulation has been stored under new names each time, i.e. “dataN”).  The resulting “data6” should now have two columns, “B0” and “B1”.

I think that’s enough for now.  In my next post I will either explain how to deal with missing data or how to plot basic X-Y plots but with colours (families plotted in different colour).







3 comments:

Malacoda said...

Great stuff Mambo, these things are really useful!

Raptor's Nest said...

Thanks Graeme! We need to make more of these!

Nick Gardner said...

"I think that’s enough for now. In my next post I will either explain how to deal with missing data or how to plot basic X-Y plots but with colours (families plotted in different colour)."

Can't wait to see more.

Nick