Analysis of the Structurally Conserved Core of a Set of Proteins
Chemical Computing Group Inc.
With today's large and ever-expanding databanks of solved protein structures, structural protein analysis is becoming increasingly powerful as a tool for making functional inferences about structures. By analyzing related protein structures and homologous sequences for conserved features and differences, important geometric and contact information can be inferred which might otherwise remain undetected when examining only sequence alignments.
In this article, we show how fundamental insights into protein structure and function can easily be gained using two of the interactive protein investigation tools of Chemical Computing Group's Molecular Operating Environment:
For purposes of demonstration, we have chosen the X-Ray structures of thirty-five related kinases from the Brookhaven Protein Data Bank (PDB). The proteins were selected by running MOE's PDB Search application against a non-redundant subset of high-resolution chains containing no chain breaks, and at most a small number of missing sidechain atoms.
To analyze a group of related proteins, the primary, secondary and tertiary structure data for the entire family should be investigated. In this article, we will demonstrate how to find the conserved core of the above set of kinases and to locate their stabilizing contacts.
The first step in the structural analysis of a set of proteins is to align their amino acid sequences. In MOE, the protein alignment facility features a powerful and completely general group-to-group implementation of the Needleman-Wunsch dynamic programming paradigm for protein sequence alignment, coupled with a structural realignment engine.
Next the conserved core is derived. With the conserved core in hand, both visual and computational examination of the structural regions most likely to be essential to the function of the proteins is possible.
In MOE, the Protein Consensus tool is used to calculate the conserved core and to aid in visualization. The conserved core, defined on a given alignment of the protein structures, comprises the common structural regions of the protein, and can be specified using a number of criteria:
The Protein Consensus panel provides interactive control over the the definition of the consensus set. For example, in the picture below, the core is defined to be the regions of the proteins where, in each alignment column:
Visualization options allow the consensus set to be dynamically displayed in both the MOE Window and the Sequence Editor. In the images below, the backbones of the chains in the conserved core of the kinases are colored by RMSD (blue is low, red is high) on the left, and by sequence variability on the right. The residues belonging to the core are also colored in the Sequence Editor according to sequence variability.
The conserved core can be used to improve the alignment of the proteins. By weighting the conserved regions more heavily, a better superposition and a correspondingly better alignment can be obtained. In this way, the alignment and the conserved core can be iteratively refined.
In addition to looking at regions of consensus, it is useful to investigate the regions of difference, in particular, the structural outliers of a set of proteins. Protein Consensus will automatically construct dendrograms of the system, mapping structure and sequence proximity relationships among the proteins. Below is the dendrogram for the set of kinases, using atomic RMSD in the conserved core as the proximity metric. We see that the conserved core of chain 35 is distant to the rest of the system; in subsequent analysis we may wish to exclude it.
Contact analysis is an important step in understanding protein structure. Distant contacts conserved across a family of proteins often reflect functionally important regions of conserved tertiary structure, such as active sites or areas that contribute to the stability of the structure.
Protein Contacts in MOE is an interactive tool for finding inter-residue hydrogen bonds, ionic contacts (salt bridges), disulfide bonds and "hydrophobic contacts". Contacts within protein chains, between chains, or between proteins and ligands can be detected. Contacts can be also be inferred between residues of sequence-only chains based on their positions in the alignment with respect to contacting residues of homologues. Once found, contacts can be browsed and isolated in both the Sequence Editor and the main MOE window.
To facilitate the identification of conserved contacts in the structural core, we can restrict contact analysis to only the structurally conserved regions as identified by Protein Consensus. In our example, we will further examine only those contacts which are conserved in at least twenty-five of the thirty-five kinases. (That is, if a contact is reported between residues A and B, then at least twenty-four of the other chains also have a similar contact between their two residues in the same columns of the alignment.)
The image above shows how the Protein Contacts panel displays chain and residue information for the detected contacts. The rightmost column of the window details the contact networks. A contact network is a connected group of contacts, for example, where residue A contacts residue B, B contacts C, and C contacts both D and E. Such a network forms a set of stabilizing contacts in a spatial region of the protein. The two largest hydrophobic networks in the system were examined further.
We demonstrated the use of two of MOE's protein analysis tools for the structural analysis of a set of thirty-five kinase proteins.
The Protein Consensus tool allows for the quick identification and isolation of the structurally conserved regions of a set of aligned proteins. The results can be visualized in both the Sequence Editor and the MOE window. A dendrogram provides further information about the protein structural relationships.
Protein Contacts identifies various types of contacts within (and among) the protein chains. We examined the larger hydrophobic networks to see how the contacts stabilize the conserved core.
The process of multiple sequence and structure alignment, structural core identification, and feature analysis (e.g., contact analysis), is at the heart of protein structure analysis. MOE's analysis tools offer quick, easy access to powerful underlying computations. The tools are all written in MOE's built-in programming language, SVL, and are shipped as source code with MOE.