Journal Articles



Protein Structure Validation and Analysis


K. Kelly
Chemical Computing Group Inc.


The stereochemical validation of model structures of proteins is an important part of the comparative molecular modeling process. Firstly, the selection of high quality structures for inclusion in loop dictionaries is important for the simple reason that these coordinate sets will be used to build future models. Secondly, the structural evaluation of comparative modeling output must be used to identify possible problematic regions.

Morris et al (PROTEINS: Structure, Function, and Genetics 12:345-364, 1992) consider as important the following problem:

"Given only the coordinates, are there automated methods which can be used to assess the overall stereochemical quality of the structure? Such methods may prove useful for the identification of incorrectly interpreted structures, either during cycles of refinement, or when refinement reaches a premature standstill. Although most PDB files do contain some author comments about crystallographic refinement, these are usually brief, qualitative, unable to be machine-read, and provide no quantitative measure of the likely reliability of the submitted structure".

As there are now a few thousand high-resolution (2 Angstroms or better) protein coordinate sets derived by X-ray crystallography in the Brookhaven Protein Data Bank, statistical analysis of high quality structures can be used to evaluate a candidate model for stereochemical quality. By comparing structural measures of a model to the distribution of those quantities in the PDB one can obtain a measure of stereochemical quality.

Mosimann et al (PROTEINS: Structure, Function, and Genetics 23:301-317, 1995), in their assessment of various comparative protein modeling submissions, conclude that

"If comparative molecular models are to be used to examine detailed differences in substrate, ligand, or receptor association, future comparative molecular modelers will have to improve the stereochemical quality of their models. This can easily be achieved by the standardized usage of some form of structure verification program ... Structure verification programs can aid the comparative molecular modeler by evaluating the stereochemical quality of proposed conformations in their models. In the submitted comparative molecular models, the "splice" points or segments containing large insertions or deletions are the regions that most often have residues with unlikely stereochemical parameters."

In fact, Mosimann et al go on to say that

"We feel that any coordinates of comparative molecular models submitted for possible publication should be accompanied by the output from a structure verification program that will allow for an objective assessment of the stereochemical quality of the model."

There is some agreement about which measurements are good indicators of stereochemical quality; these include

  • planarity;
  • chirality;
  • phi/psi preferences;
  • chi angles;
  • non-bonded contact distances;
  • unsatisfied donors and acceptors.

The Protein Tools of MOE contain an SVL program to evaluate the the stereochemical quality of a given protein coordinate set by measuring certain stereochemical parameters and comparing them to the distributions of these parameters found in the high resolution entries of the PDB. The program can be used to evaluate PDB submissions as well as homology modeling output. When launched, the program displays a prompt window that is used to control the content of the output report.

The ASCII output report contains a number of tables and a summary of stereochemical quality. Depending on the options chosen, either a comprehensive report, a listing of statistical anomalies, or only the summary will be written to the report file.

When the report file is edited with a MOE text editor, the statistical outliers will be colored red.

The protein report that is generated is a compilation of the following stereochemical measurements.

Bond Lengths. In a good protein structure, the internal bond lengths should conform to known stereochemical values. We take this to mean that model bond lengths should lie within a few standard deviations of the mean bond length observed in the PDB. The generated report will report the bond lengths (and detect statistical outliers) of the residue bonds N-CA, CA-CB, CA-C, C-O and C-N.

Bond Angles. The generated protein report will list internal bond angles and detect statistical outliers under the assumption that the angles are normally distributed about the mean angles observed in the PDB. The report contains measurements of the angles C-N-CA, N-CA-C, N-CA-CB, CB-CA-C, CA-C-N, CA-C-O, and O-C-N.

Dihedral Angles. Perhaps the most revealing section of the protein report concerns itself with the internal dihedral angles. The phi, omega, psi, and chi1 angles are reported along with the more recent OC-CO dihedral. Furthermore, the Kabsch & Sander secondary structure assignment for the residue is listed.

Statistical outliers are detected as follows:

  • Omega and Chi1. Deviations from perfect cis or trans conformations under the assumption of a Gaussian mixture for the omega angle are highlighted.

  • Phi and Psi. A 10 degree grid was placed on the 2D square [-pi,pi] X [-pi,pi]. Each grid square was marked with one of four annotations: CORE, ALLOWED, GENEROUS and OUTSIDE according to observed frequences in high-resolution structures in the PDB. This annotated grid is then used to evaluate the phi and psi angles of a proposed model.

The following is a color-coded depiction of the phi-psi map used to evaluate proposed models. The coding is 3: CORE (dark green), 2: ALLOWED (light green), 1: GENEROUS (yellow), 0: OUTSIDE (red).

Nonbonded Contacts. A complete analysis of all interatomic distances is conducted and all non-bonded and non-hydrogen bonded contacts where the van der Waals radii overlap by more than a prescribed value are identified and listed in the generated report.

Statistical Summary. At the end of the report a summary of several measures and their statistical significance is presented. The values in the proposed model are compared to expected values taken from high resolution structures in the PDB. The report is of the form

Parameter
Observed
Mean
Observed
S.D.
Expected
Mean
Expected
S.D.
trans omega 179.5 2.8 180.0 5.8
C-alpha chirality 33.1 3.4 33.8 4.2
chi 1 - gauche minus -63.3 15.5 -66.7 15.0
chi 1 - gauche plus 61.8 16.6 64.1 15.7
chi 1 - trans 184.0 20.5 183.6 16.8
helix phi -67.5 12.0 -65.3 11.9
helix psi -33.4 12.7 -39.4 11.3
chi 1 - pooled s.d. - 18.8 - 15.7
proline phi -67.1 10.6 -65.4 11.2

The Protein Stereochemical Report is an important tool for evaluating protein coordinate sets. In particular, it is an important part of the comparative protein modeling process. Because the program is written in SVL, the built-in programming language of MOE, it can be modified and enhanced as needed by MOE users. The integration into the MOE system makes the program concise and easily accessible.

For more information, please see