Skip to main content

Principal coordinate analysis and the quest for a solution to a non-existent problem

I had an interesting experience yesterday - spent a good few hours on a silly problem. You don't need to know the technicality of the analyses at all, but I'm sure you'll appreciate the humour in this.

I am frequently running principal coordinate (PCo) analyses recently. This is because I am using an interesting application of multiple regressions and PCoA on phylogeny vs phenotypic variables called the phylogenetic eigenvector regression (PVR; Diniz-Filho et al., 1998; Desdevises et al., 2003). In short, you take a phylogenetic tree of a given group of animals (or plants, or whatever your favourite group of organism), reduce the complex topology into manageable columns of numbers (by PCoA), and test these columns with some phenotypic/ecological variable of your choice for any correlations using multiple regression. Sounds pretty easy, and it is, in practice at least. You can code R to do this very efficiently, if you know the R language already.

Anyway, yesterday, I reread the protocol that I had been following for the last few weeks and realised that, PCoA on phylogenetic trees could sometimes result in negative eigenvalues - which is kind of annoying if you think of eigenvalues as representing the "amount of variation in the data explained by that axis" (Hammer and Harper 2006); a negative value indicates a negative contribution???. Supposedly, this is because of the nature of the phylogenetic distance matrix not necessarily being Euclidean distances. So I had a look at my PCoA results and realised to my horror that a lot of my values were negative. Holy shit! Do I have to go back and reanalyse?

But first, I did the sensible thing and checked if my distances were Euclidean or not (using the is.euclid() in R). Surprisingly, or unsurprisingly, my distances were Euclidean. Strange. But the values are negative...

I sat there scratching my head for a while.

I read further and noted that in cases where you get negative eigenvalues, you may need to transform your original distance matrix following some standard procedures.

I searched for the relevant references and there were several suggested transformation procedures. All of them seemed pretty straight forward. So I tried all of them in turn.

None of them worked. The negative values are still there....

I'm really stuck now. I don't know what the cause of this problem is. Is there something inherently wrong with my data? Is there some other transformation that I could still use? Is there another command in R that could potentially solve this problem - it's really common in R for you to miss a basic command - ? or, is there something fundamentally screwed up with the PCoA command in R, and I've discovered some serious programming failure?

But at this point, it's time for my coffee break. I went out for coffee with my girlfriend, complained to her about it, of course with no solution other than stress relief (which of course I am extremely grateful for her to provide me). I went back to my office, sat down in front of my computer again for more head scratching - by this time, it's more like head-banging-on-desk.

But then, as I was reviewing the R commands for PCoA, it all hit me. How could I be so stupid?

There's this thing in R that returns what's called "points" and "eig", the former being the coordinate points of each specimen along each PCo axis within the multidimensional space, and the latter being the eigenvalues associated with each axis. And "points" are returned by default. I had been looking at the "points" all this time. Of course, the points are going to include negative values because the whole ordination is done so that the points are scattered around the origin.

I turned the "eig" feature on, and R returned the eigenvalues; all positive.

I never thought I could be extremely happy with myself at the same time as being incredibly furious for making such a stupid mistake.

The moral of this story is: you learn from your mistakes.

Comments

Malacoda said…
I never make these sort of mistaks
es...
I'm newbie for principal coordinate analysis. Infact my current assign ment ot o write about principal coordinate analysis and principal component analysis.... but the problem is that i dun get enough simple reference to get an idea about these principals....can u please any useful explanation? thank you.
Hi Sudharsan,

I've found it difficult to find a single good reference on principal coordinates and principal components analyses, so I had to read multiple sources, namely multivariate statistics text books. But there is a pretty good essay by Norman MacLeod (Natural History Museum, London) that explains in relatively simple terms PCA and other related methods:

http://www.palass.org/modules.php?name=palaeo_math&page=3

The great thing about this essay series is that it starts with correlations and regressions and extends the line of thought to PCA.
Unknown said…
Holy crap! I found this blog post with the EXACT same problem in R. You just saved me some time.
Pauly said…
I'm also having a problem very close to this one, with the exception that i'm actually getting a non-euclidean distance matrix out of UniFrac (strictly speaking, NOT a positive semi-definite matrix). And i'm also wondering what the hell to do with those negative eigenvalues. Does anyone know what they ultimately mean?

Popular posts from this blog

The difference between Lion and Tiger skulls

A quick divergence from my usual dinosaurs, and I shall talk about big cats today. This is because to my greatest delight, I had discovered today a wonderful book. It is called The Felidæ of Rancho La Brea (Merriam and Stock 1932, Carnegie Institution of Washington publication, no. 422). As the title suggests it goes into details of felids from the Rancho La Brea, in particular Smilodon californicus (probably synonymous with S. fatalis ), but also the American Cave Lion, Panthera atrox . The book is full of detailed descriptions, numerous measurements and beautiful figures. However, what really got me excited was, in their description and comparative anatomy of P. atrox , Merriam and Stock (1932) provide identification criteria for the Lion and Tiger, a translation of the one devised by the French palaeontologist Marcelin Boule in 1906. I have forever been looking for a set of rules for identifying lions and tigers and ultimately had to come up with a set of my own with a lot of help

R for beginners and intermediate users 3: plotting with colours

For my third post on my R tutorials for beginners and intermediate users, I shall finally touch on the subject matter that prompted me to start these tutorials - plotting with group structures in colour. If you are familiar with R, then you may have noticed that assigning group structure is not all that straightforward. You can have a dataset that may have a column specifically for group structure such as this: B0 B1 B2 Family Acrocanthosaurus 0.308 -0.00329 3.28E-05 Allosauroidea Allosaurus 0.302 -0.00285 2.04E-05 Allosauroidea Archaeopteryx 0.142 -0.000871 2.98E-06 Aves Bambiraptor 0.182 -0.00161 1.10E-05 Dromaeosauridae Baryonychid 0.189 -0.00238 2.20E-05 Basal_Tetanurae Carcharodontosaurus 0.369 -0.00502 5.82E-05 Allosauroidea Carnotaurus 0.312 -0.00324 2.94E-05 Neoceratosauria Ceratosaurus 0.377 -0.00522 6.07E-05 Neoceratosauria Citipati 0.278 -0.00119 5.08E-06 Ovir

Hind limb proportions do not support the validity of Nanotyrannus

While it was not the main focus of their paper, Persons and Currie (2016) , in a recent paper in Scientific Reports hinted at the possibility of Nanotyrannus lancensis being a valid taxon distinct from Tyrannosaurus rex , using deviations from a regression model of lower leg length on femur length. Similar to encephalisation quotients , Persons and Currie devised a score (cursorial-limb-proportion; CLP) based on the difference between the observed lower leg length and the predicted lower leg length (from a regression model) expressed as a percentage of the observed value. The idea behind this is pretty simple in that if the observed lower leg length value is higher than that predicted for its size (femur length), then that taxon gets a high CLP score. I don't particularly like this sort of data characterisation (a straightforward regression [albeit with phylogeny, e.g. pGLS] would do the job well), but nonetheless, Persons and Currie found that when applied to Nanotyrannus , it