3D Bioinformatics and Comparative Protein Modeling in MOEby K. Kelly
MOE features an advanced, comprehensive set of protein modeling applications that range from sequence analysis and alignment, multiple-structure alignment and superposition, remote homologue identification to 3D model building. As built-in components of MOE's computing and visualization environment, they present a uniform, intuitive graphical interface which is fully accessible to the occasional user, while at the same time offering unparalleled flexibility and power to the specialist. This article describes each of the individual applications that comprise MOE's sequence-to-structure suite of applications and demonstrates their use in the creation of the homology models submitted by CCG to the Third Community Wide Experiment on the Critical Assessment of Protein Structure Prediction.
The genome project is creating a wealth of data which promises to radically transform biological knowledge and yield a host of new therapeutic proteins and drug targets. Fulfilling this promise will require functional characterization of the gene products as well as detailed analyses of the underlying biochemical processes in which they participate. Protein structure data is rapidly becoming indispensable to this venture - despite the fact that experimentally determined protein models remain, by comparison to the vast output of sequencing projects, a relatively scarce commodity. This is explained by the emergence of computational techniques which directly exploit knowledge of protein structure to improve the accuracy and reliability of automated sequence-to-structure and structure-to-structure alignments. These techniques are steadily extending the limits of automated, reliable homology identification and comparative modeling to increasingly remote levels of sequence similarity.
At the heart of MOE's protein modeling suite is an advanced protein alignment methodology which incorporates multiple-structure alignment as well as a threading methodology based on the adjustment of similarity scores by Bayesian-based secondary structure prediction. Homology searching is performed over a unique database of high-resolution protein structures, clustered into a highly non-redundant set of structure-based alignments. The application set is embedded into a visualization environment which seamlessly links structure and sequence, and a computational environment including a comprehensive collection of simulation and optimization features. The following picture situates these applications within the sequence-to-structure work flow.
In the fall of 1998, CCG used MOE to create homology models in the third Critical Assessment of Structure Prediction (CASP3). For purposes of this competition, CASP organizers solicited, from experimental protein structure modelers, amino acid sequences for which protein structures were expected to be released in a timely fashion. Researchers into homology, threading and ab initio structure prediction were then invited to apply their methods to the available targets.
Using MOE's homology detection and modeling facilities, models were submitted for targets 55, 62, 69 and 82. Of these, only the model for target 69 remains unassessed as the experimental model has not yet been released. Targets 55, 62 and 82 were particularly interesting because the sequence homologies of the targets to the best template models were quite low: 20% or less.
In the remainder of this article, we will introduce and describe the components of MOE's homology modeling suite, and show - in a step-by-step fashion - how the model for CASP3 target 62 was built. We also include a summary of MOE's performance compared to other research methodologies on targets 55, 62 and 82.
MOE-SearchPDB searches for protein structures that are homologous to a query amino acid sequence. The search tool uses MOE's protein alignment capability to test the query sequence against each entry in the MOE homology databank. This databank is a library of alignments which were generated by clustering the unique, high-resolution chains in the Protein Data Bank, augmented by data from public domain sequence-only databases into families of related structures. The key technology used in this process was MOE-Align (described below). For more information on the entire process for building the family database, see Exhaustive and Iterative Clustering of the PDB.
Aligning a query sequence against a relatively small set of reliable and diverse pre-built alignments rather than against all possible candidates individually greatly increases the likelihood of uncovering remote homologous structures. This style of search features better sensitivity as well as better selectivity. The latter is manifested in higher observed Z-scores - the measure of significance of an alignment score - when remote homologs are tested against a diverse family of structures as opposed to individuals.
Shown below is the graphical interface to MOE-SearchPDB. The sequence to CASP3 target 62 is shown in the query window, and the result of the search is displayed in the lower part of the panel. As will be seen in the next section, the hit displayed in the results window was actually a family of proteins, of which PDB entry 1CNF was presented as the most similar to the query. The extremely high Z-score guaranteed, in practical terms, that the hit was a true positive. The "hydrophobic fitness score" (Huang et al, JMB, 252:709-720) is a pseudo-energy which measures the degree to which a particular sequence/fold combination results in buried hydrophobic cores. In this instance, the calculated value strongly confirmed not only the validity of the hit, but also the accuracy of the global alignment of the query to the target family.
MOE-Align is a powerful and flexible application for multiple sequence and structure alignment of protein chains. Among its distinguishing features is its ability to calculate accurately affine gap penalties when aligning alignments. (For more detail on this point, see The A* Search and Applications to Sequence Alignment). MOE-Align also possesses a structure-based alignment facility, which can even be employed when aligning a group of sequences for which only some have associated 3D data available. The following snapshot illustrates the MOE-Align panel set over MOE's Sequence Editor window showing the alignment loaded from the previous search.
The Sequence Editor allows amino acid chains to be displayed in their current alignments, with optional display of the dynamically updated, color-coded DSSP secondary structure assignments and plots of residue properties of interest (e.g., hydrophobicity, charge, degree of solvent exposure). Ramachandran plots, chi angle plots and alpha-carbon distance plots are created and displayed using MOE plot functions. Residues can be selected according to class (alcoholic, hydrophobic, etc) or structural characteristics (secondary structure, buried vs. accessible, conserved region, etc). Finally, the interaction between the Sequence Editor and the 3D rendering window allows one to isolate regions of interest.
The simplest homology modeling scenario occurs when building a model from a single template with relatively high similarity to the target protein. In such cases, models can be built more or less automatically using MOE-Align and MOE-Homology. On the other hand, when the sequence identity is relatively low, or when MOE-SearchPDB identifies a family of relatively diverse structures as homologous to a query, furthur analysis of the alignment may be fruitful. MOE contains a number of analysis tools which take advantage of the powerful interplay between the 3D view of protein structures and the sequence level view presented in the Sequence Editor. For example, MOE-Consensus is an interactive application for identifying regions of an alignment in which structural and/or sequence variation is below a user-specified cutoff. Shown below is a picture of the Protein Consensus panel:
MOE-Consensus calculates the RMSD values of the mainchain atoms at each position of the alignment and of all the atoms at positions which conserve residue identity. As one moves the sliders in the panel, the set of atoms identified as belonging to the conserved core of the family is isolated accordingly in the 3D rendering window (shown below), while the residues to which they belong are highlighted in the Sequence Editor (this is shown above).
Returning to our example using the CASP3 target 62, we find that the sequence identity of the query to the family of homologous structures ranged from approximately 16% to less than 10%. The pairwise percentage identities within the family itself showed similar values. In this case, MOE-Consensus was used to identify the conserved core at a conservative criteria of 1.0 Angstroms RMSD; those atoms belonging to the core were then fixed in place during the first phase of the energy refinement stage of the homology modeling procedure.
MOE-Homology creates full-atom, energy-minimized 3D models of amino acid sequences from one or more template structures. The underlying methodology is based on a combination of the segment-matching procedure of Levitt (JMB 226:507-533) and an approach to the modeling of indels based on that of Fechteler et al ) (JMB 253:114-131).
By default, MOE-Homology creates 10 models, each of which is generated by making a series of Boltzmann-weighted choices of sidechain rotamers and loop conformations from a set of protein fragments selected from the built-in library of high-resolution protein structures. Each of the candidate models can be saved in a molecular database for further analysis, while an average model is created and then submitted to a user-controlled level of potential energy minimization. One can pick any of MOE's standard molecular mechanics forcefields which include two variants of the AMBER forcefield as well as MMFF94.
Structure verification programs play an integral role in the creation of protein structure models whether the models be determined from experimental data or homology modeling. Inspections of high-resolution entries in the Protein Data Bank (e.g., Morris et al, Proteins, 12:345-364, 1992) have resulted in the emergence of a number of objective criteria that assess the stereochemical quality of a putative model.
MOE-ProEval is an implementation of the PROCHECK suite of stereochemical measurements that computes and prints out a detailed list of protein stereochemical properties. A sample output is shown below:
Statistical outliers are detected as follows:
Additionally, the MOE menus include a wide variety of plotting and measurement features including, for example, an alpha-Carbon distance matrix plot, residue hydrophobicity and solvent-accessibility plots and sequence-based predictions of 3-state secondary structure propensities. A graphic of one such example, the Ramachandran plot, is shown here:
This section includes tables of results which were extracted from data available at the CASP3 web site. For each research group, the results for the model designated as number one are presented. Each table is sorted using the crn field, which is defined as the alpha-Carbon RMSD between model and target, divided by the number of modelled alpha Carbons. This field is the default primary sort key on the CASP web site. The %Atoms field contains the percentage of target atoms which were present in the submitted model. Models for which 100% of the atoms were modelled are printed in boldface. The %Align field refers to an estimate of the accuracy of the alignment used to build the model.
The purpose of this article has been to present MOE's comprehensive protein modeling suite. Using the various tools in MOE, one can perform the complete sequence to structure work flow from remote homology detection, model building and refinement to 3D visualization. The suite of applications combines ease of use with advanced, demonstrably effective methodologies. To illustrate these points, we have used CCG's results in the CASP3 competition as an example. Using MOE, the homology models were built from templates sharing approximately 20% sequence identity to the target sequences. All the models assessed by the CASP3 organizers compared well to the best models submitted by the various participating research groups.