The A* Search And Applications To Sequence Alignment
Chemical Computing Group Inc.
1010 Sherbrooke Street W, Suite 910; Montreal, Quebec; Canada H3A 2R7.
In this article we present an augmentation of the traditional dynamic programming method of sequence alignment capable of correctly scoring gap initiation and extension penalties when aligning two pairs of alignments. The algorithm is based upon the A* search. We present the A* search algorithm along with a proof of correctness and a description of its application to multiple sequence alignment.
THE A* SEARCH ALGORITHM
The A* Search algorithm is a procedure to find the best path in a directed acyclic graph with associated edge scores. It is a best-first search algorithm that makes use of a heuristic to evaluate the score of unsearched portions of the graph. It has been extensively studied in the context of tree searches; however, all of its graph search properties are not generally known. Therefore, in this section, we will outline the algorithm and offer proofs of its correctness and properties.
Let G* be a directed acyclic graph specified by a collection of nodes, or vertices, V, and a collection of edges each specified by an ordered pair of nodes (m,n) and a positive score s (we say that the edge from m to n has score s). This definition does not exclude the possibility that more than one score can be associated between nodes m and n. A path in G* is a sequence of edges with the property that each edge (except the first) has its "from" node equal to the "to" node of the preceding edge. The score of a path is the product of the edge scores of the edges in the path. (Note that by taking logarithms in all subsequent formulae, the A* search can be applied to summation-based scoring mechanisms). A path from m to n is called optimal if its score is greater than or equal to all other paths from m to n.
We distinguish two distinct nodes in V, the source node, r, and the destination node, f. We define a complete path to be a path that originates at r and terminates at f. The primary objective of the search is to find an optimal complete path. As we shall see, the A* search algorithm solves this problem and has the useful property of being able to produce all best paths.
For each node n in G*, define the forward score, A(n) to be 1 if n = r and the score of an optimal path from r to n otherwise. Similarly, we define the backward score, B(n) to be 1 if n = f and the score of an optimal path from n to f otherwise. Note that the quantity A(n)B(n) is the score of an optimal complete path through n.
The A* search algorithm assumes the existence of a heuristic backward score; that is, for each node n an estimateb(n) of B(n) with the property that b(n) >= B(n). In the context of a graph search, the A* algorithm assumes the existence of a monotone heuristic. A monotone heuristic has the property that for every edge (m,n,s) from node m to node n, b(m) >= sb(n).
We now describe the A* search algorithm in detail. The algorithm constructs subgraph, G, of G* with the property that G will contain all of the "best" paths of G* (the notion of best will be discussed below). The algorithm maintains two sets of nodes: the set O of open nodes (nodes yet to be searched), and the set C of closed nodes (nodes that have been searched). All nodes in G are either open or closed. During the course of the search each node n in G will contain an estimate a(n) of A(n). The A* search algorithm proceeds as follows.
The construction in the Expansion step of the algorithm guarantees that a(n) is the score of some path from r to n; however, the path is not necessarily optimal in G* or even in G. Therefore, in general, a(n) <= A(n).
The proof of correctness will require a number of lemmas each of which is interesting on its own and provide insight into the behavior of the A* algorithm.
Lemma 1. Let p be a node in G*. If m and n are two nodes on an optimal path in G* from r to p in which m precedes n then A(m)b(m) >= A(n) b(n).
This lemma states that along an optimal path, the monotone heuristic property guarantees that the upper bound on the score of an optimal complete path decreases. The second lemma concerns the forward score estimate a(n) of A(n). Recall that, in general, a(n) <= A(n); however, equality is assured at specific points in the algorithm.
Lemma 2. On every iteration of the A* search algorithm, a node n in the open set O with maximal a(n)b(n) (in particular, the node selected for expansion) has the property that a(n) = A(n).
The preceding lemma states, in particular, that when the algorithm terminates, a(f) = A(f); i.e., that the score of the optimal complete path is, in fact, computed by the algorithm. The correctness of the A* search algorithm is given by the following
Theorem. When the A* algorithm terminates, every complete path with a score greater than or equal to tA(f) is contained in G, the subgraph built by the algorithm.
APPLICATION TO MULTIPLE SEQUENCE ALIGNMENT
In comparison of protein sequences, a matrix defining relatedness between the different amino acids is used along with a scheme for penalizing the introduction of gaps. It is recognized that non-proportional gap penalizing functions - such as linear functions of the form
where C is a constant "gap opening" penalty, and E is a proportional gap extension penalty, produce much better alignments than a simple proportional function (Gotoh, 1990).
Given such a scoring scheme, the optimal alignment between two sequences can be calculated by the application of the dynamic programming principle. (Needleman and Wunsh, 1970). This is easily seen if we represent the space of possible alignments between two sequences of lengths M and N as a graph of nodes (i,j) (0 <= i <= M, 0 <= j <= M) where the source node is (0,0) and the goal node is (M,N). The successors of a given node (i,j) are the nodes (i+1,j), (i+1,j+1) and (i,j+1) (if they exist).
In the case of only two sequences, we can derive the optimal score and path to a node (I,J) if we have already computed and stored at each of the predecessor nodes the optimal scores from each of their predecessor nodes. In principle we can similarly work in a K-dimensional space to solve the K-way alignment problem. However, the computational complexity of the recursion step grows too rapidly - faster than K factorial (Altschul, 1989). In practice, therefore, 3-way alignments tend to be the limit.
Consequently the multiple alignment problem is generally attacked through iterative use of two-way alignment of alignments. However, when either of the alignment groups contains internal gaps, then the simple recursion described above is no longer applicable. This can happen whenever an internal gap in one group is matched against a new or pre-existing gap in the other group, and is followed then by a gap being matched against a character. In this case, the information required to assign the gap penalty is no longer guaranteed to be represented in the any of the predecessor nodes.
This means that the problem of aligning two groups of alignments cannot be reduced to the MxN matrix description, and one must instead deal with the (very much) larger state space populated by the nodes (I,J) where each of I and J are bit-vectors of length M and N, where the successors of given node are (I<<1, J<<1), (I<<1 + 1, J<<1), and (I<<1 + 1, J<<1 + 1). Let [x] denote the number of 1 bits in a bit vector x. The source node is (0,0) and the goal nodes are all nodes where [I] == N, and [J] == M).
We have observed that in matrix of partial path scores (constructed by the dynamic programming scheme), if we take the gap penalty whenever it might be applicable, then we have an admissible and monotone heuristic for an A* search. Each cell (i,j) of the matrix serves as a heuristic upper bound for all nodes (I,J) where [I] == i and [J] == j.
For this to be true it must be that that actual cost from a goal node to a node X, plus the edge cost from X to a predecessor node Y is always less than or equal to the H(Y), the heuristic overestimate for the source node (0,0) to node X. This turns out to be true if we take the gap constant penalty only those edges that close gaps.