Current solution strategies for sequence alignment begin by postulating some model of evolution in order to score proposed alignments. This transforms the task into development of good score functions and the search for alignments with high score.
The BLAST family of tools uses a hash-and-extend method to find regions of the test sequences which have high similarity, measured by its score model. A hash-and-extend method searces for identical short subsequences by a very fast hash table search and extends these short exact matches to longer not quite exact matches.
Dynamic programming tools, including CLUSTAL, use a score array to find the best scoring global alignment. Needleman-Wunsh and Smith-Waterman are classic algorithms for the two sequence global alignment problem.
Both of these approaches are useful for aligning sequences when the true picture is a one-to-one correspondence, possibly with simple deletions or insertions.
BLAST picks out one-to-many similarities. CLUSTAL finds a one-to-one alignment of whole sequences or alignment profiles. Neither is optimal for the other task.
Both approaches are fooled by sequences which have internal
near repeats. This can happen when there are duplications
from part of an ancestor sequence in each of its descendants,
When the duplications are not identical for each descendant or
duplications have independently mutated, there are many regions in
one sequence similar to many regions in another.
CLUSTAL chooses one path, which maximizes global score, but
may have a nonsense agglomeration of gaps. BLAST typically
reports every pair of similar subsequences. It is an easy in silico
demonstration that BLAST can miss similiarities entirely
if mutations are peppered just frequently enough.
Our goal is to develop an alignment method which finds good global alignments and does not cut out those alignments which score well but not highest.
Merits of our score function are: it is meaningful as a probability; it describes locally the value of an alignment at a point; and it is comparable within an alignment and across alignments.
These features permit detection of repeats and duplications.
We concentrate on developing technique for pairwise alignment in a way which is extensible to small numbers of sequences.
A larger goal, not yet realized, is multiple genomic alignments.
Toward this end, we suggest a conceptually easy way
to avoid computing on regions which score so poorly,
that they are unlikely to contribute to any of the good alignments.
It is mentioned after the algorithm description for
clarity of the exposition.
Here is the structure of the paper. Section 2 is a technical summary of the algorithm. Section 3 is the detailed construction. Section 4 presents example applications.