Tuesday, September 28, 2010

Quick update - two year old Pachyrhinosaurus project

I don't know if anyone remembers this ancient post and this follow-up on my Pachyrhinosaurus reconstruction, but I've just yesterday pulled out my half-finished drawing and started my process of finishing it.  I just realised that my original post was about 2 years ago; it's about time I finished the darned thing.  I've completely abandoned layering by anatomy (e.g. layers of muscle, skins) and reverted to my comfortable method of just fleshing it out the way I like. I've realised it's the only way to get it finished!

I also have to make my post on plotting in R...

Friday, September 10, 2010

R for beginners and intermediate users: reading and manipulating data

I had been preparing a comprehensive tutorial on how to plot in R (The R Project) with different groups differentiated in different colours, but Blogger stupidly erased my post and decided to automatically save my empty draft at that precise moment. Since I cannot reproduce the original post, I decided to break it up into a series of smaller topics.

There are plenty of R resources available in various places but I found that they are frequently one of two extremes; either too basic or too advanced.  I think of myself as an intermediate user (i.e., I can comfortably handle canned packages but want a bit more control than the default settings allow) so the type of info I find are not too helpful. So I thought it would benefit others like me if I summed up some of the simple things I learned over the last year or two.

As a first of such posts, I will deal with reading in and manipulating data.  These may be very simple and basic, but some of the things I wanted to do required a bit more than reading a manual.  I will try and explain things as simply as I can so that beginners can also find some use from these posts.

So here we go.

First, we should set up the working directory.  This is the directory (or folder) where you want R to read in data from and write out results to.  You don't have to do this but it's sometimes useful to do so.

In Windows, you can find a drop down menu "Change dir..." under the "File" menu.  In Mac's this would be under the "Miscellaneous" menu.  This prompts you to select a directory.  I don't particularly like this approach because it takes time to navigate through many levels of directories to get to the one you are looking at; e.g. select "C Drive", select "Users", select "YOUR USERNAME", select "Documents"… etc… or whatever your pathway is.

An alternative is to use the setwd() function, for instance like this:

  setwd("C:/Users/User Name/Documents/FOLDER")

Note that the pathway (C:/…) has to be within quotes (“…”) and the pathway separators are slashes (/) instead of backslashes (\) as in Windows pathway displays. If you are unsure if you have set your working directory correctly, then you can check by getting working directory, getwd().

Now that you have set your working directory we can start reading in our data. This would require that you have your data stored as a tab delimited txt file or something similar like comma delimited csv file for instance.  For this example, I will use my published dataset of theropod biting performance measures.  The txt file looks roughly as follows:

  Taxa                          B0            B1          B2           Family
  Acrocanthosaurus     0.307931296    -0.00329298   3.28E-05    Allosauroidea
  Allosaurus           0.302008604   -0.002847656   2.04E-05    Allosauroidea
  Archaeopteryx        0.142338967   -0.000870802   2.98E-06             Aves
  Bambiraptor          0.181541103      -0.001606   1.10E-05  Dromaeosauridae
  Baryonychid          0.189377202    -0.00237557   2.20E-05  Basal_Tetanurae
  Carcharodontosaurus  0.368623687   -0.005015715   5.82E-05    Allosauroidea


The first column contains the names of the theropods, second to fourth the data and the fifth column the family names, as evident from the first row.  We want to keep this structure so we will read in the data telling R to acknowledge the first row as the header and the first column as the row names:

data <- read.table("FILENAME.txt", header=T, row.names=1)

Here the data is read in and stored as an object called “data”.  The FILENAME has to be within “”.  The bit “header=T” or “header=TRUE” specifies that the first row is a header and “row.names=1” specifies the first column as row names.  You can review your data by typing in “data” which would print out your data table, or you can type “str(data)” which will show you a compact description of the structure of your object “data”. The latter will return a list that looks like this:

  > str(data)
  'data.frame':   42 obs. of  4 variables:
   $ B0    : num  0.308 0.302 0.142 0.182 0.189 ...
   $ B1    : num  -0.00329 -0.00285 -0.00087 -0.00161 -0.00238 ...
   $ B2    : num  3.28e-05 2.04e-05 2.98e-06 1.10e-05 2.20e-05 ...
   $ Family: Factor w/ 13 levels "Allosauroidea",..: 1 1 2 6 4 1 8 8 11 5 ...

This tells us that object “data” is of the class “data.frame” with 42 observations (our 42 dinosaurs) and 4 variables (B0, B1, B2, and Family). Variables “B0”, “B1”, and “B2” are numerical data but “Family” is a factor.  For some analyses like principal components analysis, non-numerical variables like “Family” cannot be included, so we will have to exclude this variable (more on this later).  The variables (or any other content of an object) are indicated by a “$” and you can always call up an individual variable within an object, e.g. “data$B0”.  This is useful when you want to use specific components of an object for analyses (for instance a regression of B0 against B1) or plotting (e.g. B0 against B1) (more on plotting in my next post).

Next, I’d like to explain briefly the structure of R data tables. For instance, “data” is a 42 by 4 data matrix in terms of rows vs columns, which is how R handles tables; the format that R understands tables is [rows,columns].  So if you want to see the B2 value for Allosaurus then you would type “data[2,3]” because Allosaurus is the second row and B2 is the 3 column and R will return that value which is “2.04e-05”.  Similarly, if you want to review all the values for B0, then you would type “data[,1]” to call up the entire first column (or alternatively you can type “data$B0” as I’ve described above). If you want to review all the values for a given taxon (row), let’s say Allosaurus, then you would type, “data[2,]”, which returns:

  > data[2,]
                       B0              B1        B2         Family
  Allosaurus    0.3020086    -0.002847656  2.04e-05   Allosauroidea

Now we can move on to manipulating data in the simplest ways. As I’ve mentioned above, some analyses don’t like non-numerical data and we would have to eliminate the column “Family” from “data” for these analyses.  One way to do this is to compile a new table using the cbind() function like this:

data2 <- cbind(data$B0, data$B1, data$B2)

This will bind the vectors “data$B0”, “data$B1”, and “data$B2” together into a table. Unfortunately, the row names and column headers are stripped in the process so we have to assign them again.  For row names we can simply take them from “data”:

  rownames(data2) <- rownames(data)

Column names on the other hand are a bit more troublesome as there are four columns in “data” and only three in “data2”.  We have to directly name them like this:

  colnames(data2) <- c(“B0”, “B1”, “B2”)

The function cbind() also seems to create a object of class “matrix” so if you want a “data.frame” instead (which is useful if you want to use the $ operator to call individual columns) then we’d need to reassign “data2” as a data.frame object:

data2 <- data.frame(data2)

Using cbind() to create a data table of desired columns is fine just as long as the number of variables is manageable.  In many cases (such as large multivariate data sets) this is not possible, so we need to resort to an alternative, which is to delete columns or rows.  This simple procedure of deleting rows/columns is not straightforward in R and it took me a bit of searching before I found how to do it.  Let’s start with deleting a variable, in our example, the non-numerical variable “Family”.  Since family is the fourth column in “data”, we have to somehow eliminate data[,4].  It turns out that it is actually quite simple; just put a “-“ in front of the column (or row) number:

data3 <- data[,-4]

By typing in “length(data3[1,])”, which shows you the number of items in the first row in the new data set “data3”, R should return a value of “3” .  The command “str(data3)” should also give a short list with three variables.

The same can be done for rows; just put a “-“ in front of the row number you wish to eliminate.  For instance, if we want to delete Allosaurus from “data3”, then we would type:

  data4 <- data3[-2,]

We can also delete multiple rows (or columns) at once.  I will give an example first:

  data5 <- data3[-c(2,7),]

Here, I specified the second and seventh rows to be deleted from “data3”. The “c(2,7)” combines values “2” and “7” into a vector or a list; this is the format that R likes for lists of values.  So our row specification of data3[row,column] is a vector (list) including the values “2” and “7”.  And there is a “-“ in front of it to tell R to delete the values within this list. Of course, you can always simply repeat the code to produce “data4” (see above) and eventually get the same thing as “data5” but that involves some tedious coding if you have a lot of rows to eliminate.

Multiple columns can also be deleted simultaneously in a similar manner:

  data6 <- data[,-c(3,4)]

This removes columns 3 and 4 from the original data set “data” (which incidentally is still stored within R’s memory as a separate object because all the data manipulation has been stored under new names each time, i.e. “dataN”).  The resulting “data6” should now have two columns, “B0” and “B1”.

I think that’s enough for now.  In my next post I will either explain how to deal with missing data or how to plot basic X-Y plots but with colours (families plotted in different colour).

Thursday, September 9, 2010

I hate Blogger autosave!

Until now, I quite liked Blogger's auto save feature.  Not any more.  I was hitting Ctrl Z to undo things until for some reason the whole post disappeared, and then at that moment, Blogger decided to auto save.... I lost a whole evening's worth of blogging and I can't remember the phrasing I used which I really liked.

Sunday, September 5, 2010


I have been meaning to write about this for the longest time, but things kept getting in my way.  Now, I have the perfect opportunity, as Paolo at Zygoma has coordinated with me to write a post on this very topic.  When Paolo was at the Bristol City Museum, I used to go bother him a lot, and together we'd go through their extensive cat skull collection.  One day, we came across a very interesting tiger skull specimen.  The box kind of said it all; it was labelled 'TIGER' on one end and 'MAN-EATER' on the other.  So we excitedly opened the box and found an isolated skull with no mandible but with a handwritten label.  The label read:

So this tiger was hunting humans for two years (how regularly, no one knows) until someone shot it dead.  Upon examining the skull it was apparent why this tiger was preferentially hunting humans; its canines are heavily worn down.

The canines even look like they could have broken and were subsequently worn down from continued use.  With its teeth so worn down this tiger must have found hunting large game difficult, so it resorted to hunting easy prey, i.e. humans. It's a pretty neat specimen, and we admired it for a while, but after our initial excitement wore down, we moved on to other specimens.

However, this specimen was yet to give us all its surprises. After I had gone through quite a few of the cat skull boxes, I came across a number of isolated mandibles; some lion, but others tiger.  I noted that the specimen numbers on these mandibles matched those of isolated skull materials that I had previously measured so I got pretty excited.  What's more, I found the mandibles to the 'man-eater' tiger skull:

As you can see, it is totally messed up (yes, that is the technical term).  This tiger had fractured its right mandible and survived long enough (at least two years judging from the label) for it to heal. However, obviously the bones had not set right and so this tiger probably couldn't bite properly. Combined with the worn-down canines it must have made hunting extremely difficult.

A further shock is how the teeth occlude.

A keen observer may have noticed an odd hole in the palate of the ventral view photo above, but I had completely missed that when I saw the isolated skull specimen and I had not noticed it until I found the mandible. It turns out the hole was caused by the lower molar biting into the palate.  It looks almost as if the bones in the palate gave way over a period of time, so this tiger was probably biting and chewing at a regular interval for quite some time after the fracture had healed. Although I am not a pathologist so I don't know for sure that's what happened...

Aside from the obvious job satisfaction of studying specimens in museum collections, the occasional specimen like this 'man-eater' makes it that much fun to work with historical museum collections.  Most of the cat specimens in the Bristol Museum are trophy specimens but some have unique histories.  There are a couple of more specimens from the Bristol Museum that are quite interesting so I may post something about them in the future.

Thanks to Rhian Rowson of the Natural History Collection at the Bristol Museum for encouraging me to write this post.

And last but not the least, be sure to check out Paolo's post for more man-eater tiger specimens: http://paolov.wordpress.com/2010/09/05/maneaters/