Tuesday, May 12, 2009

Principal coordinate analysis and the quest for a solution to a non-existent problem

I had an interesting experience yesterday - spent a good few hours on a silly problem. You don't need to know the technicality of the analyses at all, but I'm sure you'll appreciate the humour in this.

I am frequently running principal coordinate (PCo) analyses recently. This is because I am using an interesting application of multiple regressions and PCoA on phylogeny vs phenotypic variables called the phylogenetic eigenvector regression (PVR; Diniz-Filho et al., 1998; Desdevises et al., 2003). In short, you take a phylogenetic tree of a given group of animals (or plants, or whatever your favourite group of organism), reduce the complex topology into manageable columns of numbers (by PCoA), and test these columns with some phenotypic/ecological variable of your choice for any correlations using multiple regression. Sounds pretty easy, and it is, in practice at least. You can code R to do this very efficiently, if you know the R language already.

Anyway, yesterday, I reread the protocol that I had been following for the last few weeks and realised that, PCoA on phylogenetic trees could sometimes result in negative eigenvalues - which is kind of annoying if you think of eigenvalues as representing the "amount of variation in the data explained by that axis" (Hammer and Harper 2006); a negative value indicates a negative contribution???. Supposedly, this is because of the nature of the phylogenetic distance matrix not necessarily being Euclidean distances. So I had a look at my PCoA results and realised to my horror that a lot of my values were negative. Holy shit! Do I have to go back and reanalyse?

But first, I did the sensible thing and checked if my distances were Euclidean or not (using the is.euclid() in R). Surprisingly, or unsurprisingly, my distances were Euclidean. Strange. But the values are negative...

I sat there scratching my head for a while.

I read further and noted that in cases where you get negative eigenvalues, you may need to transform your original distance matrix following some standard procedures.

I searched for the relevant references and there were several suggested transformation procedures. All of them seemed pretty straight forward. So I tried all of them in turn.

None of them worked. The negative values are still there....

I'm really stuck now. I don't know what the cause of this problem is. Is there something inherently wrong with my data? Is there some other transformation that I could still use? Is there another command in R that could potentially solve this problem - it's really common in R for you to miss a basic command - ? or, is there something fundamentally screwed up with the PCoA command in R, and I've discovered some serious programming failure?

But at this point, it's time for my coffee break. I went out for coffee with my girlfriend, complained to her about it, of course with no solution other than stress relief (which of course I am extremely grateful for her to provide me). I went back to my office, sat down in front of my computer again for more head scratching - by this time, it's more like head-banging-on-desk.

But then, as I was reviewing the R commands for PCoA, it all hit me. How could I be so stupid?

There's this thing in R that returns what's called "points" and "eig", the former being the coordinate points of each specimen along each PCo axis within the multidimensional space, and the latter being the eigenvalues associated with each axis. And "points" are returned by default. I had been looking at the "points" all this time. Of course, the points are going to include negative values because the whole ordination is done so that the points are scattered around the origin.

I turned the "eig" feature on, and R returned the eigenvalues; all positive.

I never thought I could be extremely happy with myself at the same time as being incredibly furious for making such a stupid mistake.

The moral of this story is: you learn from your mistakes.

5 comments:

Malacoda said...

I never make these sort of mistaks
es...

Sudharsan Selvaraja said...

I'm newbie for principal coordinate analysis. Infact my current assign ment ot o write about principal coordinate analysis and principal component analysis.... but the problem is that i dun get enough simple reference to get an idea about these principals....can u please any useful explanation? thank you.

Raptor's Nest said...

Hi Sudharsan,

I've found it difficult to find a single good reference on principal coordinates and principal components analyses, so I had to read multiple sources, namely multivariate statistics text books. But there is a pretty good essay by Norman MacLeod (Natural History Museum, London) that explains in relatively simple terms PCA and other related methods:

http://www.palass.org/modules.php?name=palaeo_math&page=3

The great thing about this essay series is that it starts with correlations and regressions and extends the line of thought to PCA.

Matt said...

Holy crap! I found this blog post with the EXACT same problem in R. You just saved me some time.

Pauly said...

I'm also having a problem very close to this one, with the exception that i'm actually getting a non-euclidean distance matrix out of UniFrac (strictly speaking, NOT a positive semi-definite matrix). And i'm also wondering what the hell to do with those negative eigenvalues. Does anyone know what they ultimately mean?