Journal Articles



Multiple Sequence and Structure Alignment in MOE


K. Kelly
Chemical Computing Group Inc.


Introduction

MOE possesses a powerful and flexible facility for multiple sequence and multiple structure alignment of protein chains. A unique feature of MOE's protein alignment tool, MOE-Align, is that it allows mixed structural and non-structured data. The foundations of the alignment procedure are:

  • Sequence Alignment. A general implementation of the Needlman-Wunsch alignment which accurately calculates the non-proportional gap penalty even when aligning alignments. This alignment procedure is based upon the A* search algorithm (described in The A* Search and Applications to Sequence Alignment in this issue of the Journal of the Chemical Computing Group).

  • Structure Superposition. The superposition steps in the alignment procedures use the SVL function Superpose which is a general superposition procedure which operates on multiple 3-dimensional point-sets, with flexible weights governing both the contribution of the i'th point in each point set, as well as the weight given to the interactions between any two point sets; for example, this allows individual corresponding points to be weighted by importance.

In this article we will describe MOE's multiple sequence and multiple structure alignment procedure, its output as well as the user interface of the application.

Alignment Protocol

The input to the alignment process is a collection of protein chains. Typically, these chains are the results of a sequence search of an unstructured database (e.g., PIR) and the results of a search of a structured database (e.g., PDB). The objective of MOE-Align is to produce a single multiple sequence alignment and a simultaneous superposition of all structured chains in the input collection.

Multiple sequence alignment is implemented with a flexible 3-stage protocol based upon pairwise alignment of alignments comprising the following steps:

  1. Pile Up. Each sequence is added to an accumulated alignment one at a time in order to create an initial alignment.

  2. Round Robin. The initial alignment is refined as follows. Each sequence is removed from the collection aligned to the remaining sequences treating them as a single block; that is, we perform a 1-many alignment for each of the sequences in turn.

  3. Random Refinement. A random iterative refinement process is conducted in an effort to improve the alignment. On each iteration, the collection of sequences is randomly partitioned into two subsets. This pair of subsets is treated as two blocks and re-aligned.

The Round Robin and Random Refinement stages, in almost all cases, significantly improve the multiple sequence alignment. However, each stage is optional, which allows, in particular, the refinment of existing alignments. The entire process is controlled with a single graphical interface consisting of the following control panel.

The fundamental sequence alignment algorithm of MOE is capable of producing a pairwise alignment of two sequences given an arbitrary similarity matrix. For sequence data, MOE-Align uses residue identity based matrices (roughly 20x20). For structured data, MOE-Align uses an MxN similarity matrix with each entry populated by a similarity measure of the atomic coordinates. The similarity matrix is populated with values derived from the spatial displacement of residue pairs in a 3D superposition of the structures. This raises a subtle point in that a residue with missing coordinate data cannot be compared to structured residues in the other chain.

In light of this problem, MOE-Align classifies each input chain into one of three classes:

  1. Complete Structured - a chain with a complete alpha carbon trace.
  2. Incomplete Structured - a chain with a partial alpha carbon trace.
  3. Unstructured - all other chains (e.g., primary sequence only).

Given such a partition of the input, it is necessary to use separate metrics (depending on the classes) to align each pair of chains. MOE-Align will use a 3D structure similarity matrix only if both sequences have been classified as Complete Structured and a residue identity similarity in all other cases.

If more than two Complete structures are present, MOE-Align performs a complete structural subalignment before proceeding with the alignment of the other chains. In the subsequent alignments the structured data will always be treated as a single block effectively reducing the problem to a situation in which there is at most one Complete Structured chain (or alignment). Thus, we obtain the following alignment protocol:

  1. Sequence Pile Up. Each sequence is added to an accumulated alignment one at a time in order to create an initial alignment.

  2. Sequence Round Robin. The initial alignment is refined as follows. Each sequence is removed from the collection aligned to the remaining sequences treating them as a single block; that is, we perform a 1-many alignment for each of the sequences in turn.

  3. Sequence Random Refinement. A random iterative refinement process is conducted in an effort to improve the alignment. On each iteration, the collection of sequences is randomly partitioned into two subsets. This pair of subsets is treated as two blocks and re-aligned.

  4. Structural Refinement. If there are more than two Complete Structured chains in the collection then MOE-Align proceeds with the Structural Refinement stage. In this stage all of the Complete Structured chains are refined separately (see below).

  5. Random Refinement. A random iterative refinement process is conducted in an effort to improve the alignment. On each iteration, the collection of sequences is randomly partitioned into two subsets (always treating the Complete Structured chains as a single collective). This pair of subsets is treated as two blocks and re-aligned.

The Structural Refinement stage uses an initial alignment (the one produced by the sequence data only multiple sequence alignment after the first steps of the above procedure). The Structural Refinement stage is a Round Robin and Random Refinement protocol except that an MxN similarity matrix is used. In each of these stages, the Complete Structured chains are partitioned into two sets in preparation for a pairwise alignment of alignments. This pairwise alignment proceeds as follows.

  1. Correspondence Definition. The current alignment is used to a set of alpha Carbon correspondences. Each column of the alignment in which there are no gaps defines one correspondence set of alpha Carbons.

  2. Structure Superposition. The correspondences are used to the input coordinates to an optimal n-body superposition. The optimal translation and rotations are calculated and the coordinates of all the chains are updated.

  3. Matrix Population. The entire collection of superposed 3D coordinates are now used to populate an MxN 3D similarity matrix based upon the distances between alpha Carbon coordinates.

  4. Re-alignment. The 3D similarity matrix is then used as input to the Sequence Alignment algorithm which produces a new alignment for the chains.

This procedure is iterated and MOE-Align successively refines a (possibly) non-optimal initial alignment (eg. one generated by a multiple sequence alignment) for the Complete Structured chains.

The procedures so far are sufficient to produce a multiple sequence alignment of mixed structured and unstructured protein chains. In addition, the Complete Structured chains will have been superposed. However, one detail remains: the superposition of the Incomplete Structured coordinates. For each Incomplete Structured chain in the input collection, MOE-Align performs the following:

  1. The final output alignment is used to a set of alpha carbon correspondences between the Incomplete Structured chain and the collection of superposed Complete Structured chains.

  2. For each column of the Incomplete Structured chain in which there are not all gaps in the Complete Structured chains a 3D coordinate is determined by averaging the alpha Carbon coordinates of the Complete Structured chains.

  3. The corresponding alpha Carbons of the Incomplete Structured chain and the averages from the Complete Structured chains are used as input to a 2-body superposition. The results of this superposition are used to update the coordinates of the Incomplete Structured chain

In this way a collection of mixed sequence and structural data can be simultaneously aligned and superposed in such a way that the structural data affects the sequence alignment and vice versa.

The output of the entire procedure is written to the MOE terminal window. The report consists of the global RMSD of the superposed chains and a pairwise assessment of similarity and alignment quality. As an example, we take structures from the Cytochrome C family with PDB codes

256B.A 1RCP.A 2CCY.A 1BBH.a

To form a basis for comparison, we disable the structured alignment components of MOE-Align and apply the procedure (Pile-Up, Round Robin, and Random Refinement) using residue identities only. This produces the output

pro_Align score (sum of pairs): 5632.0 (pileup)
pro_Align score (sum of pairs): 5680.0 (round robin)
pro_Align score (sum of pairs): 5681.0 (shuffle #8)
pro_Align score (sum of pairs): 5694.0 (shuffle #9)

pro_Align: pairwise percentage residue identity
256B.A 1RCP.A 2CCY.A 1BBH.A
256B.A: 100.0 28.3 22.6 31.1
1RCP.A: 23.3 100.0 25.6 24.8
2CCY.A: 18.9 26.0 100.0 24.4
1BBH.A: 25.2 24.4 23.7 100.0
256B.A : ADL--EDNME T----L-ND- -NL-KV---I EKADNAAQVK DA-LTKMRAA
1RCP.A : ADT--KEVLE ARE-AY-FK- -SLGGS---M KAMTGVAKAF DAEAAKVEAA
2CCY.A : QSK-PEDLLK LRQ-GL-MQ- -TL-KSQW-V PIAGFAAGKA DL-PADAAQR
1BBH.A : AGLSPEEQIE TRQAGYEFMG WNMGKIKANL EGEYNAAQV- EA-AANVIAA

256B.A : ALDAQKAT-- -P---PK-LE --D-K-SPDS PE------MK DFRHGFDILV
1RCP.A : KLEKILATDV AP-LFPAGTS STDLP-GQTE AKAAIWANMD DFGAKGKAMH
2CCY.A : AENMAMVAKL APIGWAKGTE --ALPNGETK PE-AFGSKSA EFLEGWKALA
1BBH.A : IANSGMGALY GP-GTDKNVG --DVK-TRVK PE--FFQNME DVGKIAREFV

256B.A : GQIDDALKLA NEGKVKEAQA AAEQLKTTRN AYHQKYR---
1RCP.A : EAGGAVIAAA NAGDGAAFGA ALQKLGGTCK ACHDDYREED
2CCY.A : TESTKLAAAA KAGP-DALKA QAAATGKVCK ACHEEFK-QD
1BBH.A : GAANTLAEVA ATGEAEAVKT AFGDVGAACK SCHEKYR-AK

When the structural alignment components are enabled, in order to refined the above alignment, MOE-Align produces the following output.

pro_Align: pairwise percentage residue identity
256B.A 1RCP.A 2CCY.A 1BBH.A
256B.A: 100.0 20.8 16.0 21.7
1RCP.A: 17.1 100.0 19.4 22.5
2CCY.A: 13.4 19.7 100.0 18.1
1BBH.A: 17.6 22.1 17.6 100.0
pro_Align global RMSD: 6.463
pro_Align global RMSD: 3.092
pro_Align global RMSD: 2.936
pro_Align global RMSD: 2.738
pro_Align global RMSD: 2.685

Pairwise RMSD
    - upper triangle under optimal pairwise superposition
    - lower triangle under optimal global superposition
256B.A 1RCP.A 2CCY.A 1BBH.A
256B.A: 0.000 3.497 3.218 3.255
1RCP.A: 3.506 0.000 2.555 2.111
2CCY.A: 3.219 2.570 0.000 2.309
1BBH.A: 3.258 2.190 2.335 0.000
256B.A : ---------- ADLEDNMETL NDNLKVIEKA -----DNAAQ VKDALTKMRA
1RCP.A : --ADTK-EVL EAREAYFKSL GGSMKAMTGV AK--AFDAEA AKVEAAKLEK
2CCY.A : --QSKPEDLL KLRQGLMQTL KSQWVPIAGF AAGKADLPAD AAQRAENMAM
1BBH.A : AGLSPE-EQI ETRQAGYEFM GWNMGKIKAN L-EGEYNAAQ VEAAANVIAA

256B.A : AALDAQK-AT PPKLED---- ------KSPD SPEMKDFRHG FDILVGQIDD
1RCP.A : ILATDVAPLF PAGTSSTDLP GQTEA-KAAI WANMDDFGAK GKAMHEAGGA
2CCY.A : VAKLAPIGWA KGTEA----L PNGETKPEAF GSKSAEFLEG WKALATESTK
1BBH.A : IANSGMGALY GPGTDKNVGD VKTRVKPEFF Q-NMEDVGKI AREFVGAANT

256B.A : ALKLANEGKV KEAQAAAEQL KTTRNAYHQK YR---
1RCP.A : VIAAANAGDG AAFGAALQKL GGTCKACHDD YREED
2CCY.A : LAAAAKA-GP DALKAQAAAT GKVCKACHEE FKQD-
1BBH.A : LAEVAATGEA EAVKTAFGDV GAACKSCHEK YRAK-

Notice that the global RMSD of all of the structures falls from 6.463 (when the pre-structural alignment is used) to 2.685 after structural refinement. Also notice that the gaps at positions 20 and 50 have been cleaned up.