Tools for phylogenetic analysis

Smith-Waterman pairwise alignment

This simple Excel sheet aligns two DNA, RNA or amino acid sequences and calculates the similarity between them. The amino acid scoring matrix is PAM250. In the case of nucleotides, the average length of a gene is ideal, although you can use it for longer sequences considering that it takes much longer. As the program Dnadist in the Phylip package, missing bases, insertions or deletions don’t count and an ‘N’ is a match when it calculates the percent similarity. It also handles other ambiguous bases: R, Y, K, M, etc.

The actual alignment takes a few seconds, worthwhile if you consider that to calculate the similarity otherwise, you would have to go through a series of programs such as those in the Phylip package.

If you type in the word ‘reverse’ in the second column, it will do the alignment of the reverse and complement of the nucleotide sequence.

To run the program, paste the sequences in the sheets labelled as ‘sequence 1’ and ‘sequence 2’. It's ok if the sequences have numbers, spaces or return characters. The program will ignore all these but letters and will put the sequences in the first and second cells on the main sheet. Pull down ‘Tools’, select ‘macros’ and run ‘align’. Or just hit Control + "a".

The program is written in Visual Basic for Applications from scratch based on the Smith-Waterman method that is described in any Bioinformatics book. You might find a teaching purpose for this program but I also use it to get a quick and accurate percent similarity between two sequences. The following is a description on how the Smith-Waterman alignment method works.

The algorithm was described back in 1981 by Smith and Waterman (PubMed) and finds subregions that align well. It assumes that any cell in the alignment matrix could be the end of a local alignment and its similarity is represented by a value in each cell.

As an example, let's align the sequences ACGGCTC and ATGGCCTC in the matrix below.

Let's set a few parameters: 1 for a match, -3 for a mismatch and -4 for a gap.

First, it fills in the first row and column with 1 if there is a match and 0 if there is a mismatch.

	A	C	G	G	C	T	C
A	1	0	0	0	0	0	0
T	0
G	0
G	0
C	0
C	0
T	0
C	0

Then, it fills in the rest of the matrix. For each cell, if there is a match it takes the value of 1, which is the value for a match, plus the value of the previous cell in the diagonal. If there is no match it takes the highest values from three different calculations or 0 if the highest value turns out to be negative. In the first case, it takes the value of -3 and then it adds the value from the previous cell in the diagonal. For example, in the cell in yellow below it should be -3 (a mismatch) +1 (previous cell in the diagonal) = -2. For the second and third cases it considers a deletion to find a greater score than the one calculated in the first case. It does this by taking the value of -4 (value for a gap) and adding the value in the previous row or in the previous column (-4 + 0 in both cases). Since all three values are negative, the score in the cell is 0.

	A	C	G	G	C	T	C
A	1	0	0	0	0	0	0
T	0	0
G	0
G	0
C	0
C	0
T	0
C	0

The matrix should look like this when it is complete:

	A	C	G	G	C	T	C
A	1	0	0	0	0	0	0
T	0	0	0	0	0	1	0
G	0	0	1	1	0	0	0
G	0	0	1	2	0	0	0
C	0	1	0	0	3	0	1
C	0	1	0	0	1	0	1
T	0	0	0	0	0	2	0
C	0	1	0	0	1	0	3

Now it finds the cell with the greatest value, which is 3 in the cell in red and represents the end of the alignment with the highest score. It goes then backwards following the path with the highest score (in yellow). It goes also forward from the cell in red following the path with the highest score. If you start at the cell in red and move forward, you find a 0 in front, a 0 in the diagonal but 1 in the cell below so it takes that path, which means a gap there that gives a better alignment score. In the example there are actually two cells with a value of 3 but either one gives the same path.

The final alignment should look like this:

ACGG-CTC

ATGGCCTC

The same applies to the alignment of peptide sequences except that a scoring matrix is used to calculate the scores for each pair of amino acids in a way that a mismatch between the same kind of amino acids is not considered the same as a mismatch between different kinds of amino acids.

Last modified on February 16, 2009

Go back