The one state ground process

Next: Modifying the one state Up: Algorithm construction Previous: Algorithm construction

The one state ground process

The ground process is a simple model of evolution. It has predictive strength and permits a clean description of the new algorithm. The ground process is really a complex parameter and a different process may be substituted without essential changes to our algorithm. For example, we could use the process of [TKF2], which allows different substitution models for transitions and transversions.

The ground process evolution model is that beginning with an ancestor sequence ${\bf C}$ , the sequence of each generation is mostly copied faithfully to the next generation, and occassionally letters are substituted, deleted, or inserted. The frequency of these events is expressed per letter, per evolutionary distance, i.e. time.

The ground process is reversible. The probability for a particular letter at a certain point to be deleted is the same as the probability to insert that particular letter at a that point. Substitution of letter for letter has the same probability as the reverse substitution.

For sequences ${\bf A}$ and ${\bf B}$ , let

$\begin{displaymath}Pr_{t}({\bf A},{\bf B})\end{displaymath}$

be the probability that ${\bf A}$ and ${\bf B}$ are separated by evolutionary distance

, and so diverged

ago. Let

$\begin{displaymath}Pr_t({\bf B}\vert {\bf A})\end{displaymath}$

be the probability that ${\bf A}$ evolves to ${\bf B}$ over evolutionary distance

. In this view, $\Pi_{\bf A}$ is the equilibrium probability to observe ${\bf A}$ after a long time and $\Pi_{\bf B}$ the equilibrium probability to observe ${\bf B}$ .

A consequence of reversibility is

$\begin{displaymath}Pr_{t}({\bf A}, {\bf B}) = \Pi_{\bf A}Pr_{t}({\bf B}\vert {\bf A}) = \Pi_{\bf B}Pr_{t}({\bf A}\vert {\bf B}).\end{displaymath}$

Thus, it is unnecessary to find the hidden ancestor of ${\bf A}$ and ${\bf B}$ in order to evaluate their similarity.

We describe the ground process in two parts: the development of an evolutionary history, and the specification of the letters which fill that history.

The start state of the machine writes column zero

${\mathcal S}$

${\mathcal S}$

, on the paper tape, expressed as the ordered pair $({\mathcal S}, {\mathcal S})$ . The transition to each state writes a column on the paper tape, using ${\mathcal N}$ to stand in for a nucleotide to be selected in the next phase. The available states and their transition outputs are: homology, $( {\mathcal N}, {\mathcal N})$ ; deletion, $( {\mathcal N}, -)$ ; insertion, $(- , {\mathcal N})$ ; and termination, $({\mathcal E}, {\mathcal E})$ . A nonhomology event, including mismatch, is deletion followed by insertion.

An evolutionary history is a sequence of transitions beginning at start and ending at termination. Define the fate of a nonblank symbol in the ${\bf A}$ sequence (top or first member of pair) as the sequence of columns beginning with that symbol and ending just before the column with the next nonblank symbol in the ${\bf A}$ sequence. The fate of ${\mathcal S}$ is start followed by some number of insertions. The fate of ${\mathcal N}$ is either homology or deletion followed by some number of insertions. The fate of ${\mathcal E}$ is termination.

The parameters for determining the probability of an evolutionary history according to a ground process are the insertion rate $\lambda$ expressed per base per time, the deletion rate $\mu$ , and the evolutionary distance . We compute the probabilities using the limit for large of a discrete process with insertion rate $\lambda t/K$ per base, deletion rate $\mu t/K$ per base. The resulting probabilities are expressions in $l = \lambda t$ and $m = \mu t$ .

Let

$\begin{displaymath}B = \frac{l - le^{l-m} } {m - l e^{l - m} }.\end{displaymath}$

Let

$\begin{displaymath}\alpha = l/m.\end{displaymath}$

(This

is $\lambda \beta(t)$ in [TKF], and $\alpha$ is $\lambda \beta(\infty)$ .)

The probability for ${\mathcal S}$ to be part of an evolutionary history with insertions is

$\begin{displaymath}(1-B)B^k.\end{displaymath}$

The probability to extend ${\bf A}$ by another letter is $\alpha$ . The probability to transition to end is $1 - \alpha$ .

Given that ${\bf A}$ is extended by another letter, we have the following probabilities. The probability for homology followed by zero insertions is

$\begin{displaymath}H = e^{-m} (1-B).\end{displaymath}$

The probability for homology followed by

insertions is

$\begin{displaymath}HB^k.\end{displaymath}$

The probability for deletion followed by zero insertions is

$\begin{displaymath}E = B / \alpha .\end{displaymath}$

The probability for deletion followed by exactly one insertion is

$\begin{displaymath}N = (1 - e^{-m} - E)(1-B).\end{displaymath}$

The probability for deletion followed by

insertions is $N B^{k-1}.$

For the start and homology events, every successive insertion is with probability , but for a deletion event, the probability of the first insertion is different. The ground process in [TKF] is presented as a machine with a deletion state that has different transition probabilities from the other states for this possible first insertion.

The insight of [HWKMW], extended in [LMSH], is the description of the ground process using one main state and multiple transitions, including a ``forbidden transition'' with negative transition factor to accomplish the same result. The forbidden transition is from a deletion event with zero insertions to an insertion event.

The evolutionary history is completed to a sequence alignment by writing a letter from the alphabet in place of each ${\mathcal N}$ .

The ground process has parameters $\pi_A, \pi_C, \pi_G, \pi_T$ which define a distribution of letters. The parameter $\sigma$ determines the subsitution rate per base per time. As above, the equations can be expressed in terms of $s = \sigma t$ .

The probability for an insertion or deletion process to produce a letter is $\pi_X$ . For a homology event, the probability for the ground process to produce the pair is $\pi_X f(X,Y)$ , where

$\begin{displaymath}f(X,Y) = (1 - e^{s}) \pi_Y + \delta_{X,Y} e^s,\end{displaymath}$

and $\delta_{X,Y}$ is one or zero depending on whether

and

are the same.

The computation of the array for sequences ${\bf A}$ and ${\bf B}$ is recursive. The base cases are with either index negative is zero and

$\begin{displaymath}P(0,0) = 1 - B.\end{displaymath}$

Thereafter,

$\begin{eqnarray*} P(i,j) & = & P(i-1, j) \cdot B \pi_{{\bf A}[i]} + P(i, j - 1)... ... \\ & & P(i-1, j-1) \cdot B^2 \pi_{{\bf A}[i]}\pi_{{\bf B}[j]}. \end{eqnarray*}$

The probability to observe ${\bf A}$ and ${\bf B}$ related by some evolutionary history is $(1 - \alpha ) P(l_{\bf A}, l_{\bf B}).$

Next: Modifying the one state Up: Algorithm construction Previous: Algorithm construction

Lawren Smithline 2003-11-13