Next: Algorithm construction Up: Probabalistic pairwise sequence alignment Previous: Introduction

Algorithm Summary

We compare two sequences ${\bf A}, {\bf B}$ , of lengths $l_{\bf A}$ and $l_{\bf B}$ using a two dimensional array . The indices of range from zero to the length of each sequence.

Denote by ${\bf A}[i]$ the letter in position of sequence ${\bf A}$ , and by ${\bf A}[i_1, i_2]$ the subsequence of ${\bf A}$ from positions to .

Our metaphor for comparing ${\bf A}$ and ${\bf B}$ is writing them on a long paper tape, like CLUSTAL output without line breaks. Number the columns from zero, with the zeroeth position in each row containing a special start symbol, ${\mathcal S}$ . The last column on the tape is filled with a special end symbol, ${\mathcal E}$ . Otherwise, each position may be a letter, A, C, G, T for nucleotide sequences, or a blank, and no column is entirely blank. We model the machine which writes the tape column by column as a Markov process.

A list of transitions for the machine is an evolutionary history and has a probability which is the product of the probability of each transition.

We define the arrays , $P^\vee(i,j)$ , and , with the same indices as . Let be the probability that the machine writes a tape which can be cut off at some column so that the first row contains ${\mathcal S}$ and ${\bf A}[1, i]$ and the second row contains ${\mathcal S}$ and , interspersed with any combination of blanks.

This array produces the [TKF] sum approach when applied with their family of Markov models. Their model gives a sensible way to think of Needleman-Wunsch dynamic programming summed over all paths, rather than the classic Viterbi trace-back path.

Let $P^\vee(i,j)$ be the probability that the machine writes a tape which can be cut off at some column so that the first row contains ${\bf A}[i+1, l_{\bf A}]$ followed by ${\mathcal E}$ , and the second row contains ${\bf B}[j+1, l_{\bf B}]$ followed by ${\mathcal E}$ , with any combination of blanks. We call $P^\vee$ a back fill array and a front fill array.

Let $\Pi_{\bf A}$ be the probability for the machine to write a tape with ${\mathcal S}$ followed by ${\bf A}$ followed by ${\mathcal E}$ interspersed with blanks in the first row, and $\Pi_{\bf B}$ be the analog for ${\bf B}$ in the second row.

Let

$\begin{displaymath}V(i,j) = \log \Pi_{\bf A}+ \log \Pi_{\bf B}- \log P(i,j) - \log P^\vee(i,j).\end{displaymath}$

The combination of front fill and back fill makes the values in array comparable with each other as log likelihoods for the evolutionary process to pass through intermediate points given by coordinates . The normalization by $\Pi_{\bf A}$ and $\Pi_{\bf B}$ makes arrays for different sequences comparable. That is, we can answer whether it is more likely ${\bf A}_1$ and ${\bf B}_1$ are related than ${\bf A}_2$ and ${\bf B}_2$ are related.

In the detailed algorithm construction, we modify the TKF process to produce an array . The concept for the algorithm is the same, but the array weights likelihoods for alignments near position in ${\bf A}$ and in ${\bf B}$ more heavily than positions far away.

Next: Algorithm construction Up: Probabalistic pairwise sequence alignment Previous: Introduction

Lawren Smithline 2003-11-13