We compare two sequences
, of lengths
and
using a two dimensional array
.
The indices of
range from zero to the length of each sequence.
Denote by the letter in position
of sequence
,
and by
the subsequence of
from positions
to
.
Our metaphor for comparing and
is writing them on
a long paper tape, like CLUSTAL output without line breaks.
Number the columns from zero,
with the zeroeth position in each row containing a special start symbol,
. The last column on the tape is filled with a special end symbol,
. Otherwise, each position may be a letter, A, C, G, T for
nucleotide sequences, or a blank, and no column is entirely blank.
We model the machine which writes the tape column by column as a Markov
process.
A list of transitions for the machine is an evolutionary history and has a probability which is the product of the probability of each transition.
We define the arrays ,
, and
, with the
same indices as
. Let
be the probability that the machine writes a tape which can
be cut off at some column so that
the first row contains
and
and the second row contains
and
, interspersed with
any combination of blanks.
This array produces the [TKF] sum approach when applied with
their family of Markov models. Their model gives a sensible way to think of Needleman-Wunsch dynamic
programming summed over all paths, rather than the classic Viterbi
trace-back path.
Let be the probability that the machine writes a tape
which can be cut off at some column so that the first row contains
followed by
, and the second row
contains
followed by
, with any combination
of blanks.
We call
a back fill array and
a front fill array.
Let be the probability for the machine to write a tape with
followed by
followed
by
interspersed with blanks in the first row,
and
be the analog for
in the second row.
Let
The combination of front fill and back fill
makes the values in array comparable with each
other as log likelihoods for the evolutionary process
to pass through intermediate points given by coordinates
. The normalization by
and
makes
arrays for different sequences comparable.
That is, we can answer whether it is
more likely
and
are related
than
and
are related.
In the detailed algorithm construction, we modify the
TKF process to produce an array . The concept for
the algorithm is the same, but the array
weights likelihoods for alignments near position
in
and
in
more heavily than positions far away.