next up previous
Next: Algorithm construction Up: Probabalistic pairwise sequence alignment Previous: Introduction


Algorithm Summary

We compare two sequences ${\bf A}, {\bf B}$, of lengths $l_{\bf A}$ and $l_{\bf B}$ using a two dimensional array $W$. The indices of $W$ range from zero to the length of each sequence.

Denote by ${\bf A}[i]$ the letter in position $i$ of sequence ${\bf A}$, and by ${\bf A}[i_1, i_2]$ the subsequence of ${\bf A}$ from positions $i_1$ to $i_2$.

Our metaphor for comparing ${\bf A}$ and ${\bf B}$ is writing them on a long paper tape, like CLUSTAL output without line breaks. Number the columns from zero, with the zeroeth position in each row containing a special start symbol, ${\mathcal S}$. The last column on the tape is filled with a special end symbol, ${\mathcal E}$. Otherwise, each position may be a letter, A, C, G, T for nucleotide sequences, or a blank, and no column is entirely blank. We model the machine which writes the tape column by column as a Markov process.

A list of transitions for the machine is an evolutionary history and has a probability which is the product of the probability of each transition.

We define the arrays $P(i,j)$, $P^\vee(i,j)$, and $V(i,j)$, with the same indices as $W$. Let $P(i,j)$ be the probability that the machine writes a tape which can be cut off at some column so that the first row contains ${\mathcal S}$ and ${\bf A}[1, i]$ and the second row contains ${\mathcal S}$ and $B[1,j]$, interspersed with any combination of blanks.

This $P$ array produces the [TKF] sum approach when applied with their family of Markov models. Their model gives a sensible way to think of Needleman-Wunsch dynamic programming summed over all paths, rather than the classic Viterbi trace-back path.

Let $P^\vee(i,j)$ be the probability that the machine writes a tape which can be cut off at some column so that the first row contains ${\bf A}[i+1, l_{\bf A}]$ followed by ${\mathcal E}$, and the second row contains ${\bf B}[j+1, l_{\bf B}]$ followed by ${\mathcal E}$, with any combination of blanks. We call $P^\vee$ a back fill array and $P$ a front fill array.

Let $\Pi_{\bf A}$ be the probability for the machine to write a tape with ${\mathcal S}$ followed by ${\bf A}$ followed by ${\mathcal E}$ interspersed with blanks in the first row, and $\Pi_{\bf B}$ be the analog for ${\bf B}$ in the second row.

Let

\begin{displaymath}V(i,j) = \log \Pi_{\bf A}+ \log \Pi_{\bf B}- \log P(i,j) - \log P^\vee(i,j).\end{displaymath}

The combination of front fill and back fill makes the values in array $V$ comparable with each other as log likelihoods for the evolutionary process to pass through intermediate points given by coordinates $(i,j)$. The normalization by $\Pi_{\bf A}$ and $\Pi_{\bf B}$ makes $V$ arrays for different sequences comparable. That is, we can answer whether it is more likely ${\bf A}_1$ and ${\bf B}_1$ are related than ${\bf A}_2$ and ${\bf B}_2$ are related.

In the detailed algorithm construction, we modify the TKF process to produce an array $W$. The concept for the algorithm is the same, but the array $W(i,j)$ weights likelihoods for alignments near position $i$ in ${\bf A}$ and $j$ in ${\bf B}$ more heavily than positions far away.


next up previous
Next: Algorithm construction Up: Probabalistic pairwise sequence alignment Previous: Introduction
Lawren Smithline 2003-11-13