This simple
Excel
sheet aligns two DNA, RNA or amino acid sequences and calculates the
similarity between them. The amino acid
scoring matrix is PAM250. In the case of nucleotides, the average length of a
gene is ideal, although you can use it for
longer sequences considering that it takes much longer. As the program Dnadist in the Phylip package, missing bases, insertions or deletions don’t count and an ‘N’ is a match
when it
calculates the percent similarity. It also handles other
ambiguous bases: R, Y, K, M, etc.
The actual alignment takes a few seconds, worthwhile if you consider that to
calculate the similarity otherwise, you would have to
go through a series of programs such as those in the Phylip
package.
If you type
in the word
‘reverse’ in the second column, it will do
the alignment of the reverse and complement of the nucleotide sequence.
To run the program, paste
the sequences in the sheets labelled as ‘sequence 1’ and ‘sequence
2’.
It's ok if the sequences have numbers, spaces or return characters. The program
will ignore all these but letters and will put the sequences in the first and
second cells on the main sheet. Pull down ‘Tools’, select ‘macros’ and run ‘align’. Or just hit Control +
"a".
The program is written in Visual Basic for
Applications from scratch based on the Smith-Waterman method that is described
in any Bioinformatics book. You might find a teaching purpose for this program
but I also use it to get a quick and accurate percent similarity between two
sequences. The following is a description on how the Smith-Waterman alignment
method works.
The algorithm was described back in 1981 by Smith and Waterman (PubMed)
and finds subregions that align well. It assumes that any cell in the alignment
matrix could be the end of a local alignment and its similarity is represented
by a value in each cell.
As an example, let's align the sequences ACGGCTC and ATGGCCTC in the matrix
below.
Let's set a few parameters: 1 for a match, -3 for a mismatch and -4 for a
gap.
First, it fills in the first row and column with 1 if there is a match and 0 if
there is a mismatch.
|
A |
C |
G |
G |
C |
T |
C |
A |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
T |
0 |
|
|
|
|
|
|
G |
0 |
|
|
|
|
|
|
G |
0 |
|
|
|
|
|
|
C |
0 |
|
|
|
|
|
|
C |
0 |
|
|
|
|
|
|
T |
0 |
|
|
|
|
|
|
C |
0 |
|
|
|
|
|
|
Then,
it fills in the rest of the matrix. For each cell, if there is a
match it takes the value of 1, which is the value for a match, plus the value of the previous cell in
the diagonal. If there is no match it takes the
highest values from three different calculations or 0 if the highest value
turns out to be negative. In the first case, it takes the value of -3 and then it adds the value from
the previous cell in the diagonal. For example, in the cell in
yellow below it
should be -3 (a mismatch) +1 (previous cell in the diagonal) = -2. For the second and third cases it considers a deletion
to find a greater score than the one calculated in the first case.
It does this by taking the value of -4 (value for a gap) and adding the value in the previous row or in the previous
column (-4 + 0 in both cases). Since all three values are negative, the score in the
cell is 0.
|
A |
C |
G |
G |
C |
T |
C |
A |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
T |
0 |
0 |
|
|
|
|
|
G |
0 |
|
|
|
|
|
|
G |
0 |
|
|
|
|
|
|
C |
0 |
|
|
|
|
|
|
C |
0 |
|
|
|
|
|
|
T |
0 |
|
|
|
|
|
|
C |
0 |
|
|
|
|
|
|
The
matrix should look like this when it is complete:
|
A |
C |
G |
G |
C |
T |
C |
A |
1 |
0 |
0 |
0 |
0 |
0 |
0 |
T |
0 |
0 |
0 |
0 |
0 |
1 |
0 |
G |
0 |
0 |
1 |
1 |
0 |
0 |
0 |
G |
0 |
0 |
1 |
2 |
0 |
0 |
0 |
C |
0 |
1 |
0 |
0 |
3 |
0 |
1 |
C |
0 |
1 |
0 |
0 |
1 |
0 |
1 |
T |
0 |
0 |
0 |
0 |
0 |
2 |
0 |
C |
0 |
1 |
0 |
0 |
1 |
0 |
3 |
Now it
finds the cell with the greatest value, which is 3 in the cell in
red and represents the end of the alignment with the highest score.
It goes then backwards following the path with the highest score (in
yellow). It goes also forward from the cell in red following the
path with the highest score. If you start at the cell in red and
move forward, you find a 0 in front, a 0 in the diagonal but 1 in
the cell below so it takes that path, which means a gap there that gives a better alignment score.
In the example there are actually two cells with a value of 3 but
either one gives the same path.
The final
alignment should look like this:
ACGG-CTC
ATGGCCTC
The same applies
to the alignment of peptide sequences except that a scoring matrix is used to calculate the
scores for each pair of amino acids in a way that a mismatch between the same
kind of amino acids is not considered the same as a mismatch between different
kinds of amino acids.
Last
modified on February 16,
2009
Go back
|