Journal Articles

The A* Search And Applications To Sequence Alignment



K. Kelly, P. Labute
Chemical Computing Group Inc.
1010 Sherbrooke Street W, Suite 910; Montreal, Quebec; Canada H3A 2R7.


February 14, 1996


INTRODUCTION



In this article we present an augmentation of the traditional dynamic programming method of sequence alignment capable of correctly scoring gap initiation and extension penalties when aligning two pairs of alignments. The algorithm is based upon the A* search. We present the A* search algorithm along with a proof of correctness and a description of its application to multiple sequence alignment.

THE A* SEARCH ALGORITHM



The A* Search algorithm is a procedure to find the best path in a directed acyclic graph with associated edge scores. It is a best-first search algorithm that makes use of a heuristic to evaluate the score of unsearched portions of the graph. It has been extensively studied in the context of tree searches; however, all of its graph search properties are not generally known. Therefore, in this section, we will outline the algorithm and offer proofs of its correctness and properties.

Let G* be a directed acyclic graph specified by a collection of nodes, or vertices, V, and a collection of edges each specified by an ordered pair of nodes (m,n) and a positive score s (we say that the edge from m to n has score s). This definition does not exclude the possibility that more than one score can be associated between nodes m and n. A path in G* is a sequence of edges with the property that each edge (except the first) has its "from" node equal to the "to" node of the preceding edge. The score of a path is the product of the edge scores of the edges in the path. (Note that by taking logarithms in all subsequent formulae, the A* search can be applied to summation-based scoring mechanisms). A path from m to n is called optimal if its score is greater than or equal to all other paths from m to n.

We distinguish two distinct nodes in V, the source node, r, and the destination node, f. We define a complete path to be a path that originates at r and terminates at f. The primary objective of the search is to find an optimal complete path. As we shall see, the A* search algorithm solves this problem and has the useful property of being able to produce all best paths.

For each node n in G*, define the forward score, A(n) to be 1 if n = r and the score of an optimal path from r to n otherwise. Similarly, we define the backward score, B(n) to be 1 if n = f and the score of an optimal path from n to f otherwise. Note that the quantity A(n)B(n) is the score of an optimal complete path through n.

The A* search algorithm assumes the existence of a heuristic backward score; that is, for each node n an estimateb(n) of B(n) with the property that b(n) >= B(n). In the context of a graph search, the A* algorithm assumes the existence of a monotone heuristic. A monotone heuristic has the property that for every edge (m,n,s) from node m to node n, b(m) >= sb(n).

We now describe the A* search algorithm in detail. The algorithm constructs subgraph, G, of G* with the property that G will contain all of the "best" paths of G* (the notion of best will be discussed below). The algorithm maintains two sets of nodes: the set O of open nodes (nodes yet to be searched), and the set C of closed nodes (nodes that have been searched). All nodes in G are either open or closed. During the course of the search each node n in G will contain an estimate a(n) of A(n). The A* search algorithm proceeds as follows.

  1. [Initialization] Initially, G contains r and no edges. Set C = {}, O = {r}, a(r) = 1. Fix a constant t > 0 (this controls the termination criteria).
  2. [Selection] Let m be a node in O with maximal heuristic score a(m)b(m) over all other nodes in O. Set C=C+{m} and O=O­{m}.
  3. [Termination] If f is in C and a(m)b(m) < ta(f) then terminate the algorithm.
  4. [Expansion] For each edge (m,n,s) in G* (all edges originating from m) add n and the edge (m,n,s) to G. If n is in O set a(n)=max{a(n),sa(m)}; otherwise if n is not in C then set a(n)=sa(m) and O=O+{n}. Go to Step 2.

The construction in the Expansion step of the algorithm guarantees that a(n) is the score of some path from r to n; however, the path is not necessarily optimal in G* or even in G. Therefore, in general, a(n) <= A(n).

The proof of correctness will require a number of lemmas each of which is interesting on its own and provide insight into the behavior of the A* algorithm.

Lemma 1. Let p be a node in G*. If m and n are two nodes on an optimal path in G* from r to p in which m precedes n then A(m)b(m) >= A(n) b(n).

Proof. It is enough to prove the case where m is the immediate predecessor of n on an optimal path.

Consider such an optimal path from r to p and let s be the score of the edge in the path joining m to n. By truncating the given path to p at n one obtains an optimal path from r to n. If this were false, then one could find a better path from r to p than the given one contradicting the fact that the given path is optimal. The same remark applies to m and so A(n) = A(m)s; hence, using the monotone property

A(n)b(n) = A(m) sb(n) <= A(m)b(m)

as required.

This lemma states that along an optimal path, the monotone heuristic property guarantees that the upper bound on the score of an optimal complete path decreases. The second lemma concerns the forward score estimate a(n) of A(n). Recall that, in general, a(n) <= A(n); however, equality is assured at specific points in the algorithm.

Lemma 2. On every iteration of the A* search algorithm, a node n in the open set O with maximal a(n)b(n) (in particular, the node selected for expansion) has the property that a(n) = A(n).

Proof. The proof is by induction on the iteration number of the algorithm. On the first iteration, O = {r} and since, by definition, A(r) = 1 = a(r) the property holds.

Suppose that there exists a k such that after each iteration less that or equal to k the property held and consider iteration + 1.

Suppose that prior to expansion node n in O has maximal a(n)b(n). Since k > 1 it must be that n <> r. Consider an optimal path in G* from r to n. Let p be the first node in O on this path and let m be its immediate predecessor (m is necessarily in C). Since m was selected for expansion on some prior iteration, it must have had a maximal heuristic score and hence the induction hypothesis implies that for m, a(m) = A(m).

During the expansion of m, the node p was either closed or added to O. If p was closed then it was expanded on some prior iteration and, like m, a(p) = A(p). If p was not closed, then it was added to O with a(p) = a(m)s; hence

a(p) = a(m)s =  A(m)s = A(p)

since m is the immediate predecessor of p on an optimal path from r to p (by the argument in Lemma 1). Since both a(m) = A(m) and a(p) = A(p), we have that

a(p)b(p) = A(p) b(p) >= A(n)b(n) >=  a(n)b(n)

where the first inequality follows from Lemma 1 and the second from the fact that, in general, a(n) <= A(n). Since both m and n are in O, and n has maximal heuristic score, we have a(n)b(n) >= a(p) b(p); therefore

a(p)b(p) >= A(n) b(n) >= a(n)b(n) >=  a(p)b(p)

which implies that a(n) = A(n) as required.

The preceding lemma states, in particular, that when the algorithm terminates, a(f) = A(f); i.e., that the score of the optimal complete path is, in fact, computed by the algorithm. The correctness of the A* search algorithm is given by the following

Theorem. When the A* algorithm terminates, every complete path with a score greater than or equal to tA(f) is contained in G, the subgraph built by the algorithm.

Proof. On termination f is in C, and every node, p, in O is such that a(p)b(p) < tA(f). (By Lemma 2, since f is in C, it must be that a(f) = A(f); hence a(p)b(p) < tA(f).)

Consider a complete path in G* with a score S >= tA(f). Suppose, in order to produce a contradiction, that this path does not exist in G upon termination. It must, then, be that one of the nodes on the path is open. Let n be such an open node. Necessarily, n <> r since r was expanded on the first iteration. Let p be the first open node on an optimal path from r to n and let m be its immediate predecessor on that optimal path. (m is necessarily closed). Let s be the score of the highest scoring edge connecting m and p.

Using Lemma 2, the update formula of the expansion of node m and the fact that m and p are on an optimal path it must be that

A(p) >= a(p) >=  sa(m) = sA(m) = A( p);

hence A(p) = a(p). Since p is open, tA(f) = ta(f) >  a(p)b(p) = A(p)b( p). From Lemma 1 and the fact that the backward estimate is an overestimate, we obtain

A(p)b(p) >= A(n) b(n) >= A(n)B(n).

But the score, S, of the particular path through n under consideration is less than the score of the optimal complete path through n; therefore

S >= tA(f ) > A(n)B(n) >=  S

which is absurd.

APPLICATION TO MULTIPLE SEQUENCE ALIGNMENT



In comparison of protein sequences, a matrix defining relatedness between the different amino acids is used along with a scheme for penalizing the introduction of gaps. It is recognized that non-proportional gap penalizing functions - such as linear functions of the form

penalty = C + E * n

where C is a constant "gap opening" penalty, and E is a proportional gap extension penalty, produce much better alignments than a simple proportional function (Gotoh, 1990).

Given such a scoring scheme, the optimal alignment between two sequences can be calculated by the application of the dynamic programming principle. (Needleman and Wunsh, 1970). This is easily seen if we represent the space of possible alignments between two sequences of lengths M and N as a graph of nodes (i,j) (0 <= i <= M, 0 <= j <= M) where the source node is (0,0) and the goal node is (M,N). The successors of a given node (i,j) are the nodes (i+1,j), (i+1,j+1) and (i,j+1) (if they exist).

In the case of only two sequences, we can derive the optimal score and path to a node (I,J) if we have already computed and stored at each of the predecessor nodes the optimal scores from each of their predecessor nodes. In principle we can similarly work in a K-dimensional space to solve the K-way alignment problem. However, the computational complexity of the recursion step grows too rapidly - faster than K factorial (Altschul, 1989). In practice, therefore, 3-way alignments tend to be the limit.

Consequently the multiple alignment problem is generally attacked through iterative use of two-way alignment of alignments. However, when either of the alignment groups contains internal gaps, then the simple recursion described above is no longer applicable. This can happen whenever an internal gap in one group is matched against a new or pre-existing gap in the other group, and is followed then by a gap being matched against a character. In this case, the information required to assign the gap penalty is no longer guaranteed to be represented in the any of the predecessor nodes.

This means that the problem of aligning two groups of alignments cannot be reduced to the MxN matrix description, and one must instead deal with the (very much) larger state space populated by the nodes (I,J) where each of I and J are bit-vectors of length M and N, where the successors of given node are (I<<1, J<<1), (I<<1 + 1, J<<1), and (I<<1 + 1, J<<1 + 1). Let [x] denote the number of 1 bits in a bit vector x. The source node is (0,0) and the goal nodes are all nodes where [I] == N, and [J] == M).

We have observed that in matrix of partial path scores (constructed by the dynamic programming scheme), if we take the gap penalty whenever it might be applicable, then we have an admissible and monotone heuristic for an A* search. Each cell (i,j) of the matrix serves as a heuristic upper bound for all nodes (I,J) where [I] == i and [J] == j.

For this to be true it must be that that actual cost from a goal node to a node X, plus the edge cost from X to a predecessor node Y is always less than or equal to the H(Y), the heuristic overestimate for the source node (0,0) to node X. This turns out to be true if we take the gap constant penalty only those edges that close gaps.