Journal Articles

Analysis of the Structurally Conserved Core of a Set of Proteins

Mike Soss
Chemical Computing Group Inc.

Introduction  | Identifying the Structurally Conserved Core  | Contact Analysis  | Summary


With today's large and ever-expanding databanks of solved protein structures, structural protein analysis is becoming increasingly powerful as a tool for making functional inferences about structures. By analyzing related protein structures and homologous sequences for conserved features and differences, important geometric and contact information can be inferred which might otherwise remain undetected when examining only sequence alignments.

In this article, we show how fundamental insights into protein structure and function can easily be gained using two of the interactive protein investigation tools of Chemical Computing Group's Molecular Operating Environment:

  • Protein Consensus: MOE's structural core analysis tool, and
  • Protein Contacts: MOE's interactive tool for measuring and browsing structurally important protein contacts, such as hydrogen bonds or salt bridges.

For purposes of demonstration, we have chosen the X-Ray structures of thirty-five related kinases from the Brookhaven Protein Data Bank (PDB). The proteins were selected by running MOE's PDB Search application against a non-redundant subset of high-resolution chains containing no chain breaks, and at most a small number of missing sidechain atoms.

To analyze a group of related proteins, the primary, secondary and tertiary structure data for the entire family should be investigated. In this article, we will demonstrate how to find the conserved core of the above set of kinases and to locate their stabilizing contacts.

Identifying the Structurally Conserved Core

The first step in the structural analysis of a set of proteins is to align their amino acid sequences. In MOE, the protein alignment facility features a powerful and completely general group-to-group implementation of the Needleman-Wunsch dynamic programming paradigm for protein sequence alignment, coupled with a structural realignment engine.

Next the conserved core is derived. With the conserved core in hand, both visual and computational examination of the structural regions most likely to be essential to the function of the proteins is possible.

In MOE, the Protein Consensus tool is used to calculate the conserved core and to aid in visualization. The conserved core, defined on a given alignment of the protein structures, comprises the common structural regions of the protein, and can be specified using a number of criteria:

  • root mean squared distance (RMSD) of aligned atomic positions (mainchain or sidechain),
  • the similarity of residues in an alignment column,
  • the number of gaps in an alignment column.

The Protein Consensus panel provides interactive control over the the definition of the consensus set. For example, in the picture below, the core is defined to be the regions of the proteins where, in each alignment column:

  • There are no more than 5 gaps in the column.
  • The RMSD of the mainchain atoms is between 0 and 1 angstrom.
  • At least 20% of the residues in the column are identical.

Visualization options allow the consensus set to be dynamically displayed in both the MOE Window and the Sequence Editor. In the images below, the backbones of the chains in the conserved core of the kinases are colored by RMSD (blue is low, red is high) on the left, and by sequence variability on the right. The residues belonging to the core are also colored in the Sequence Editor according to sequence variability.

Left: Coloring by RMSD. Right: Coloring by Sequence Variability.
RMSD_color SeqVariable_color

The conserved core can be used to improve the alignment of the proteins. By weighting the conserved regions more heavily, a better superposition and a correspondingly better alignment can be obtained. In this way, the alignment and the conserved core can be iteratively refined.

In addition to looking at regions of consensus, it is useful to investigate the regions of difference, in particular, the structural outliers of a set of proteins. Protein Consensus will automatically construct dendrograms of the system, mapping structure and sequence proximity relationships among the proteins. Below is the dendrogram for the set of kinases, using atomic RMSD in the conserved core as the proximity metric. We see that the conserved core of chain 35 is distant to the rest of the system; in subsequent analysis we may wish to exclude it.

Contact Analysis

Contact analysis is an important step in understanding protein structure. Distant contacts conserved across a family of proteins often reflect functionally important regions of conserved tertiary structure, such as active sites or areas that contribute to the stability of the structure.

Protein Contacts in MOE is an interactive tool for finding inter-residue hydrogen bonds, ionic contacts (salt bridges), disulfide bonds and "hydrophobic contacts". Contacts within protein chains, between chains, or between proteins and ligands can be detected. Contacts can be also be inferred between residues of sequence-only chains based on their positions in the alignment with respect to contacting residues of homologues. Once found, contacts can be browsed and isolated in both the Sequence Editor and the main MOE window.

To facilitate the identification of conserved contacts in the structural core, we can restrict contact analysis to only the structurally conserved regions as identified by Protein Consensus. In our example, we will further examine only those contacts which are conserved in at least twenty-five of the thirty-five kinases. (That is, if a contact is reported between residues A and B, then at least twenty-four of the other chains also have a similar contact between their two residues in the same columns of the alignment.)

The image above shows how the Protein Contacts panel displays chain and residue information for the detected contacts. The rightmost column of the window details the contact networks. A contact network is a connected group of contacts, for example, where residue A contacts residue B, B contacts C, and C contacts both D and E. Such a network forms a set of stabilizing contacts in a spatial region of the protein. The two largest hydrophobic networks in the system were examined further.

Contacts belonging to the largest hydrophobic network in the system selected for visualization.

In the image below, a single kinase has been isolated for display, along with atoms belonging to the two largest hydrophobic networks, colored orange and pink. The network comprising the orange atoms appears to contribute to the stability of the helices, while that with the pink atoms, to the beta-sheets in the background. For clarity, more of the backbone than just that of the original conserved core is shown; for purposes of display, the definition of the conserved core was expanded to include mainchain atoms with RSMD less than 1.5 angstroms.


We demonstrated the use of two of MOE's protein analysis tools for the structural analysis of a set of thirty-five kinase proteins.

The Protein Consensus tool allows for the quick identification and isolation of the structurally conserved regions of a set of aligned proteins. The results can be visualized in both the Sequence Editor and the MOE window. A dendrogram provides further information about the protein structural relationships.

Protein Contacts identifies various types of contacts within (and among) the protein chains. We examined the larger hydrophobic networks to see how the contacts stabilize the conserved core.

The process of multiple sequence and structure alignment, structural core identification, and feature analysis (e.g., contact analysis), is at the heart of protein structure analysis. MOE's analysis tools offer quick, easy access to powerful underlying computations. The tools are all written in MOE's built-in programming language, SVL, and are shipped as source code with MOE.