This
will help you calculate the
Cophenetic Correlation Coefficient
(CCC). This parameter measures the
correlation between distance values calculated during tree building and the
observed distance. The CCC is a measure of
how faithfully a dendrogram maintains the original pairwise distances.
To calculate it this
Excel
sheet
uses the
output and input files of the program Neighbor or Fitch, which are distance methods and
are included in the
Phylip package. You might want to download the latest version
of
Phylip since the output files obtained with an older
version might give problems. When using Neighbor, I suggest you use the UPGMA
option to generate the tree since the other option, neighborjoining,
"corrects" for the artefacts for which the CCC tests. I also suggest to use the JukesCantor
evolutionary distances.
To run
the program open the macros window and run CCC (pull
down Tools and select "macros").
Alternatively, hit Control + "c". It will ask for the input files,
"infile", which is the input file of Neighbor or Fitch to generate a tree, and then
"outtree" (what used to be "treefile"
in the previous version of
Phylip), which is generated by any
of these two
Phylip programs. Make sure the sequence labels have
letters or numbers only, not spaces or any other characters. Once the right files are selected, it
will take the data to the Excel sheet, deleting whatever was there before. A graph shows the correlation between
the branch lengths and distances.
The
program doesn’t change any values on the input files, it only adds up the branch
lengths in the file "outtree" for each pair of taxa. The representation and the CCC could be useful to improve your tree. A wrong alignment or sequencing mistakes
in one sequence or more could lower the CCC. A low CCC
could also be due to horizontal gene transfer as has been shown to be the case
in the genera Aeromonas, Acinetobacter or
Pseudomonas. A low correlation between DNA sequence
similarities and DNA hybridization data suggests horizontal gene transfer as
well. If interested, please check
Keswani and Whitman (2001) below. You can find their calculations on this
Excel sheet,
which you can edit to analyze your own data.
Another way to validate your
tree is by calculating the distortion coefficient E. You may not need it but I used it to
make sure this program was running right. E can be calculated with the output
files of Neighbor or Fitch. The program also calculates it,
where
T = number of taxa,
ω
is some weighting function, d_{ij}
is the
observed pairwise distance between taxa i and j, p_{ij}
is the
pathlength derived distance between taxa i and j, and
if α is 1, then absolute
distance, but if α = 2, then weighted
leastsquares distance.
If
ω_{ij} = 1, then all distances are expected to have
the same error; if
ω_{ij} = 1/d_{ij},
then error is expected to be proportional to the observed distance; if
ω_{ij} = (1/d_{ij})^{2} then expected error is proportional to the
squareroot of the observed distance.
The outfile that the program
Fitch (also in the PHYLIP package) creates shows the sum of squares
and you can check this way that E, when α = 2 and
ω_{ij} = (1/d_{ij})^{2}, is sum of squares / 2.
In the case of a neighborjoining tree, you can
validate it using bootstrap values.
Bibliography
CavalliSforza, L. L. and A. W. F.
Edwards. 1967. Phylogenetic
analysis: models and estimation procedures. American Journal of Human Genetics
19:233257.
Farris, J. S.
1969. On the cophenetic correlation coefficient.
Systematic Zoology 18:279285.
Felsenstein, J. 1993.
Phylogeny Inference Package (Phylip). Version 3.5.
University of
Washington,
Seattle.
Fitch, W. M. and E. Margoliash. 1967. Construction of
phylogenetic trees. Science 155:279284.
Keswani, J., and W. B.
Whitman. 2001. Relationship of 16S
rRNA sequences similarity to DNA hybridisation in
prokaryotes. Int. J. Syst. Evol. Microbiol.
51:667678.
Sokal, R. R., and F. J. Rohlf. 1962. The
comparisons of dendrograms by objective methods. Taxon
11:3340.
Go back
