Next: Algorithm Summary Up: Probabilistic pairwise sequence alignment Previous: Probabilistic pairwise sequence alignment

Introduction

The problems of inferring common ancestors and determining the evolution of observed biological sequences have received much attention. The challenge becomes greater with the interest in mining genome scale sequences for meaning. The most popular tools produce mountains of data which are hard to assimilate.

Current solution strategies for sequence alignment begin by postulating some model of evolution in order to score proposed alignments. This transforms the task into development of good score functions and the search for alignments with high score.

The BLAST family of tools uses a hash-and-extend method to find regions of the test sequences which have high similarity, measured by its score model. A hash-and-extend method searces for identical short subsequences by a very fast hash table search and extends these short exact matches to longer not quite exact matches.

Dynamic programming tools, including CLUSTAL, use a score array to find the best scoring global alignment. Needleman-Wunsh and Smith-Waterman are classic algorithms for the two sequence global alignment problem.

Both of these approaches are useful for aligning sequences when the true picture is a one-to-one correspondence, possibly with simple deletions or insertions.

BLAST picks out one-to-many similarities. CLUSTAL finds a one-to-one alignment of whole sequences or alignment profiles. Neither is optimal for the other task.

Both approaches are fooled by sequences which have internal near repeats. This can happen when there are duplications from part of an ancestor sequence in each of its descendants, When the duplications are not identical for each descendant or duplications have independently mutated, there are many regions in one sequence similar to many regions in another. CLUSTAL chooses one path, which maximizes global score, but may have a nonsense agglomeration of gaps. BLAST typically reports every pair of similar subsequences. It is an easy in silico demonstration that BLAST can miss similiarities entirely if mutations are peppered just frequently enough.

Our goal is to develop an alignment method which finds good global alignments and does not cut out those alignments which score well but not highest.

Merits of our score function are: it is meaningful as a probability; it describes locally the value of an alignment at a point; and it is comparable within an alignment and across alignments.

These features permit detection of repeats and duplications.

We concentrate on developing technique for pairwise alignment in a way which is extensible to small numbers of sequences.

A larger goal, not yet realized, is multiple genomic alignments. Toward this end, we suggest a conceptually easy way to avoid computing on regions which score so poorly, that they are unlikely to contribute to any of the good alignments. It is mentioned after the algorithm description for clarity of the exposition.

Here is the structure of the paper. Section 2 is a technical summary of the algorithm. Section 3 is the detailed construction. Section 4 presents example applications.

Next: Algorithm Summary Up: Probabilistic pairwise sequence alignment Previous: Probabilistic pairwise sequence alignment

Lawren Smithline 2003-11-13