Taxa B0 B1 B2 Family

Acrocanthosaurus 0.307931296 -0.00329298 3.28E-05 Allosauroidea

Allosaurus 0.302008604 -0.002847656 2.04E-05 Allosauroidea

Archaeopteryx 0.142338967 -0.000870802 2.98E-06 Aves

Bambiraptor 0.181541103 -0.001606 1.10E-05 Dromaeosauridae

Baryonychid 0.189377202 -0.00237557 2.20E-05 Basal_Tetanurae

Carcharodontosaurus 0.368623687 -0.005015715 5.82E-05 Allosauroidea

.

.

.

The first column contains the names of the theropods, second to fourth the data and the fifth column the family names, as evident from the first row. We want to keep this structure so we will read in the data telling R to acknowledge the first row as the header and the first column as the row names:

data <- read.table("FILENAME.txt", header=T, row.names=1)

Here the data is read in and stored as an object called “data”. The FILENAME has to be within “”. The bit “header=T” or “header=TRUE” specifies that the first row is a header and “row.names=1” specifies the first column as row names. You can review your data by typing in “data” which would print out your data table, or you can type “str(data)” which will show you a compact description of the structure of your object “data”. The latter will return a list that looks like this:

> str(data)

'data.frame': 42 obs. of 4 variables:

$ B0 : num 0.308 0.302 0.142 0.182 0.189 ...

$ B1 : num -0.00329 -0.00285 -0.00087 -0.00161 -0.00238 ...

$ B2 : num 3.28e-05 2.04e-05 2.98e-06 1.10e-05 2.20e-05 ...

$ Family: Factor w/ 13 levels "Allosauroidea",..: 1 1 2 6 4 1 8 8 11 5 ...

This tells us that object “data” is of the class “data.frame” with 42 observations (our 42 dinosaurs) and 4 variables (B0, B1, B2, and Family). Variables “B0”, “B1”, and “B2” are numerical data but “Family” is a factor. For some analyses like principal components analysis, non-numerical variables like “Family” cannot be included, so we will have to exclude this variable (more on this later). The variables (or any other content of an object) are indicated by a “$” and you can always call up an individual variable within an object, e.g. “data$B0”. This is useful when you want to use specific components of an object for analyses (for instance a regression of B0 against B1) or plotting (e.g. B0 against B1) (more on plotting in my next post).

Next, I’d like to explain briefly the structure of R data tables. For instance, “data” is a 42 by 4 data matrix in terms of rows vs columns, which is how R handles tables; the format that R understands tables is [rows,columns]. So if you want to see the B2 value for *Allosaurus* then you would type “data[2,3]” because *Allosaurus* is the second row and B2 is the 3 column and R will return that value which is “2.04e-05”. Similarly, if you want to review all the values for B0, then you would type “data[,1]” to call up the entire first column (or alternatively you can type “data$B0” as I’ve described above). If you want to review all the values for a given taxon (row), let’s say *Allosaurus*, then you would type, “data[2,]”, which returns:

> data[2,]

B0 B1 B2 Family

Allosaurus 0.3020086 -0.002847656 2.04e-05 Allosauroidea

Now we can move on to manipulating data in the simplest ways. As I’ve mentioned above, some analyses don’t like non-numerical data and we would have to eliminate the column “Family” from “data” for these analyses. One way to do this is to compile a new table using the cbind() function like this:

data2 <- cbind(data$B0, data$B1, data$B2)

This will bind the vectors “data$B0”, “data$B1”, and “data$B2” together into a table. Unfortunately, the row names and column headers are stripped in the process so we have to assign them again. For row names we can simply take them from “data”:

rownames(data2) <- rownames(data)

Column names on the other hand are a bit more troublesome as there are four columns in “data” and only three in “data2”. We have to directly name them like this:

colnames(data2) <- c(“B0”, “B1”, “B2”)

The function cbind() also seems to create a object of class “matrix” so if you want a “data.frame” instead (which is useful if you want to use the $ operator to call individual columns) then we’d need to reassign “data2” as a data.frame object:

data2 <- data.frame(data2)

Using cbind() to create a data table of desired columns is fine just as long as the number of variables is manageable. In many cases (such as large multivariate data sets) this is not possible, so we need to resort to an alternative, which is to delete columns or rows. This simple procedure of deleting rows/columns is not straightforward in R and it took me a bit of searching before I found how to do it. Let’s start with deleting a variable, in our example, the non-numerical variable “Family”. Since family is the fourth column in “data”, we have to somehow eliminate data[,4]. It turns out that it is actually quite simple; just put a “-“ in front of the column (or row) number:

data3 <- data[,-4]

By typing in “length(data3[1,])”, which shows you the number of items in the first row in the new data set “data3”, R should return a value of “3” . The command “str(data3)” should also give a short list with three variables.

The same can be done for rows; just put a “-“ in front of the row number you wish to eliminate. For instance, if we want to delete Allosaurus from “data3”, then we would type:

data4 <- data3[-2,]

We can also delete multiple rows (or columns) at once. I will give an example first:

data5 <- data3[-c(2,7),]

Here, I specified the second and seventh rows to be deleted from “data3”. The “c(2,7)” combines values “2” and “7” into a vector or a list; this is the format that R likes for lists of values. So our row specification of data3[row,column] is a vector (list) including the values “2” and “7”. And there is a “-“ in front of it to tell R to delete the values within this list. Of course, you can always simply repeat the code to produce “data4” (see above) and eventually get the same thing as “data5” but that involves some tedious coding if you have a lot of rows to eliminate.

Multiple columns can also be deleted simultaneously in a similar manner:

data6 <- data[,-c(3,4)]

This removes columns 3 and 4 from the original data set “data” (which incidentally is still stored within R’s memory as a separate object because all the data manipulation has been stored under new names each time, i.e. “dataN”). The resulting “data6” should now have two columns, “B0” and “B1”.

I think that’s enough for now. In my next post I will either explain how to deal with missing data or how to plot basic X-Y plots but with colours (families plotted in different colour).