Protein Analysis in MOE: The Serine Proteases
Chemical Computing Group Inc.
The purpose of this article is to demonstrate the ease with which fundamental insights into protein structure and function can be gained using MOE's suite of protein investigation tools. The specific tools discussed in this study are:
Used in cooperation with MOE's main visualizing areas, these tools make interactive analysis of protein sequence and structure data fast and painless.
For this demonstration, the X-Ray structures of fifteen related serine proteases were chosen from the Brookhaven Protein Data Bank (PDB). The proteins were selected by running MOE's PDBSearch application against a non-redundant subset of high-resolution chains containing no chain breaks, and a minimum number of missing sidechain atoms. A file containing the PDB structures of the serine proteases used in this demonstration is shipped with MOE ($MOE/sample/serine_prot.moe).
The serine proteases are an extensively studied family of related endopeptidases, characterized by their so-called catalytic triad, - Asp ...His...Ser. The mechanism of action, believed to be similar for all members of this family, involves nucleophilic attack of a peptide carbonyl carbon by the hydroxy group of the serine. The imidazole ring in Histidine functions as a base, enhancing the nucleophilicity of the serine. The aspartic acid residue, although not directly involved in the active site, can be seen in X-ray structures to form a hydrogen bond with the histidine, suggesting that the acetate group of Asp is involved in a "protonation relay" (or "charge relay"), which shuffles protons back and forth amongst the members of the triad.
For a family of similar proteins, the regions of conserved primary, secondary and tertiary structure tend to include the residues involved in the active site(s), as well as other residues important to activity. For example, the members of the catalytic triad are far apart in the primary structures (amino acid sequence) of serine protease, but these residues are brought to within bond forming distance by the tertiary structure, or folding, of these proteins. Thus, investigating a group of related proteins involves analyzing the primary, secondary and the tertiary structure data for the entire family. The remainder of this article will demonstrate how MOE's protein analysis tools can be used interactively to accomplish these tasks.
Aligning and Superposing the Serine Proteases
The first step in analyzing the serine proteases using MOE is to align their amino acid sequences with MOE's protein alignment facility, MOE-Align. At the heart of MOE-Align is a powerful and completely general group-to-group implementation of the Needleman-Wunsch dynamic programming paradigm for protein sequence alignment. Among its unique features are its abilities to:
Built on top of the pairwise group-to-group facility is a flexible protocol for creating multiple alignments (comprising a progressive, or "pileup" component, as well as round-robin and randomized iterative refinment stages), as well as alignment or "freezing" of arbitrary subsets.
When we apply MOE-Align to the sample set of serine proteases, an alignment of pure sequence only data is produced. Shown below is a portion of the pairwise percentage identity table printed out by MOE-Align. Each row comprises the percentage of that row's residues which are aligned against identical residues in the chain referred to in the corresponding columns.
pro_Align: pairwise percentage residue identity 1MCT.A 1BIT 2CGA.A 1ELG 1FUJ.A 1TRY 1DST 1ETR.H 1 1MCT.A : 100.0 65.3 41.2 35.8 38.9 36.2 37.7 33.2 2 1BIT : 65.0 100.0 38.0 32.5 36.2 38.8 38.2 30.9 3 2CGA.A : 45.3 41.9 100.0 37.5 33.9 36.2 33.8 31.7 4 1ELG : 38.6 35.1 36.7 100.0 33.5 29.0 34.2 27.8 5 1FUJ.A : 38.6 36.0 30.6 30.8 100.0 23.2 36.4 23.2 6 1TRY : 36.3 39.2 33.1 27.1 23.5 100.0 32.5 30.9 7 1DST : 38.6 39.2 31.4 32.5 37.6 33.0 100.0 28.2 8 1ETR.H : 38.6 36.0 33.5 30.0 27.1 35.7 32.0 100.0 9 1ELT : 35.9 35.6 36.7 66.2 34.8 29.0 31.6 26.6 10 1LMW.B : 36.3 33.3 32.2 30.0 30.8 32.6 31.6 26.6 11 1HNE.E : 33.6 32.9 27.3 34.6 53.4 26.8 32.9 23.6 12 1PPF.E : 34.1 32.9 28.2 35.0 54.3 26.8 32.9 23.2 13 1SGT : 31.4 31.5 28.6 29.2 25.8 40.6 25.0 25.9 14 3RP2.A : 32.7 29.7 29.8 30.4 37.6 22.3 35.1 25.1 15 1HYL.A : 33.2 31.1 29.0 28.3 27.1 29.5 28.9 24.7
The purely sequence alignment produced from MOE-Align is then automatically used to superpose the alpha traces of the 3-D protein structures. A global RMSD value of 3.0675 is initially produced. The structure-based re-alignment stage reduces this value to 1.4340
Click here to see the final calculated alignment table.
MOE-Superpose calculates the globally optimal superposition (not to an average, nor to a privileged chain). It can also calculate a pairwise table which allows the user to estimate the validity of the alignment. Operating in the a user-controllable subset of residues is available.
Below is a portion of the all-pairs table of root mean square distance (RMSD) values calculated by MOE-Superpose on the serine proteases. The upper triangle represents the optimal pairwise RMSD values between chains, under the given alignment. The lower triangle represents the difference between the optimal RMSD's and the actual pairwise RMSD values imposed on the chains by the optimal global superposition. Note that, to three decimal places of precision, the values are the same.
1MCT.A 1BIT 2CGA.A 1ELG 1FUJ.A 1TRY 1DST 1 1MCT.A : - 0.817 1.737 0.900 1.075 1.216 1.331 2 1BIT : 0.000 - 1.897 1.268 1.336 1.472 1.498 3 2CGA.A : 0.000 0.000 - 1.839 1.898 2.078 1.960 4 1ELG : 0.000 0.000 0.000 - 1.034 1.116 1.091 5 1FUJ.A : 0.000 0.000 0.000 0.000 - 1.266 1.355 6 1TRY : 0.000 0.000 0.000 0.000 0.000 - 1.417 7 1DST : 0.000 0.000 0.000 0.000 0.000 0.000 - 8 1ETR.H : 0.000 0.000 0.000 0.000 0.000 0.000 0.000 9 1ELT : 0.000 0.000 0.000 0.000 0.000 0.000 0.000 10 1LMW.B : 0.000 0.000 0.000 0.000 0.000 0.000 0.000 11 1HNE.E : 0.000 0.000 0.000 0.000 0.000 0.000 0.000 12 1PPF.E : 0.000 0.000 0.000 0.000 0.000 0.000 0.000 13 1SGT : 0.000 0.000 0.000 0.000 0.000 0.000 0.000 14 3RP2.A : 0.000 0.000 0.000 0.000 0.000 0.000 0.000 15 1HYL.A : 0.000 0.000 0.000 0.000 0.000 0.000 0.000 pro_Superpose: global RMSD = 1.434
Identifying the Structurally Conserved Core
When we have the luxury of multiple structures, as we do here, we will most certainly wish to identify the conserved structural core of the family. MOE provides the MOE-Consensus tool for this task. Given an alignment, the Consensus tool calculates the pairwise RMSD among all the heavy atoms in every fully populated alignment column. Using a slider in the MOE-Consensus panel, the user can control which columns in the alignment are selected, and immediately see which pieces of the family meet the specified criteria. The Sequence Editor and the MOE window are automatically updated to show these results.
Shown below is a picture of the Sequence Editor showing the alignment columns selected by MOE-Consensus at an RMSD cutoff of 0.7 Angstroms.
The following image shows an isolated view of the corresponding backbone fragments.
MOE's 3D molecule rendering window
MOE-ContactAnalyzer is an interactive tool which searches for inter-residue hydrogen bonds, ionic contacts (salt bridges), disulfide bonds and "hydrophobic contacts", and presents the contact list in the window. The user can then browse and isolate contacts in both the Sequence Editor and in the main MOE window. Distant contacts conserved though a family of proteins can often reflect functionally important regions of conserved tertiary structure, such as active sites.
The MOE-ContactAnalyzer can be used to identify sidechain-sidechain hydrogen bonds within each of the protein chains. To facilitate the identification of conserved bonds in the structural core, we specify that only the contacts within the selected regions identified by MOE-Consensus are to be reported. Furthermore, the lists in the ContactAnalyzer are sorted by alignment order.
Having performed these steps we notice an obvious group of conserved contacts:
The Contact-Analyzer automatically isolates selected contacts in the Sequence Editor and the Molecule rendering window. Shown below is the isolated catalytic triad from all 15 chains.
ContactAnalyzer implements the hydrogen bond test published by Stickle and Rose (JMB 1992). Their test is rigid enough to isolate genuine hydrogen bonds, while loose enough to be acceptable in the context of protein models, in which hydrogen positions are generally not published because of the lack of resolution. The test comprises distance checks between the relevant heavy atoms, as well as angle tests, both out-of-plane (for sp2 hybridized donor/acceptors) and scalar angles (with varying thresholds according to the atoms involved).
Having collected a number of homologs via PDBSearch or other tools, we used MOE-Align to automatically create a multiple alignment based on both structural and sequential criteria.
MOE-Consensus allowed us to quickly identify and isolate the structural conserved regions of the alignment. The results were displayed in the sequence editor and the MOE rendering window.
We used MOE-ContactAnalyzer to identify sidechain-sidechain hydrogen bonds within each of the chains. We organized the lists in the ContactAnalyzer by alignment order, in order to facilitate the identification of conserved bonds.
The process of multiple and structure alignment, structural core identification, and feature analysis (e.g., contact analysis), is at the heart of the study of proteins. MOE's analysis tools offer access to powerful underlying computations in a framework which enhances ease of use, and visual inspection in both the 3D molecule rendering window and the Sequence Editor. The tools are all written in MOE's built in programming language, SVL, and are shipped as source code with MOE.