Journal Articles

An Integrated Application in MOE for the Visualization and Analysis of Protein Active Sites with Molecular Surfaces, Contact Statistics and Electrostatic Maps

Paul Labute
October 2006
Chemical Computing Group Inc.


The combination of widespread availability of x-ray crystal structure coordinates of proteins, possibly with bound ligands, and affordable 3D computer graphics hardware has made it possible for scientists to routinely visualize such structures in their efforts to design better inhibitors and drugs. Not so long ago, such visualizations used to be available only to computational chemistry specialists whose services would be required to operate the complex software that, more often than not, was difficult to use and ran only on expensive computer hardware. Moreover, such software was typically designed for molecular simulations and the primary intent of the visualization capabilities was to view the results of the calculations.

Increasingly, non-computational chemists are using visualization software to help in their scientific efforts. To such users, the primary use of visualization software is to "see what is going on" or "get ideas" rather than examine the output of a simulation or similar calculation. This is not to say that no calculations are performed for visualization purposes, but that such calculations are expected to be "under the hood" and not the focus of attention. Consequently, simple rotation of proteins and ligands as line/stick figures must be augmented with visualization capabilities that help with the interpretation of the atomic coordinates. A good example of this idea is the re-entrant molecular surface originally proposed by [Lee 1971] and further developed in [Richards; 1977]. The surface consists of the boundary of the volume of space defined by removing all points inside the van der Waals surface of a spherical probe (that represents the solvent) located at all positions for which the probe does not contact the van der Waals volume of a molecule. This means that there is "fill-in" of space between the van der Waals volume representing solvent inaccessible regions:

The van der Waals surface of two Argon atoms separated by 4.5 Angstroms. Note the small space between the atoms that is inaccessible to a sphere of radius 1.4 Angstroms (representing water). The re-entrant surface of two Argon atoms separated by 4.5 Angstroms. Note that the space between the atoms is now interior to the surface indicating that it is inaccessible to a sphere of radius 1.4 Angstroms.

It is an unfortunate fact that the definition of the re-entrant surface is such that it is computationally demanding to calculate. Since the first implementation described in [Connolly 1983] there have been ever more efficient methods that have appeared over the years (e.g., [Sanner 1996], [Liang 1998] and [Rocchia 2002]). What is important is that as computing power becomes cheaper what was once a complex calculation requiring special hardware now can be done in a few seconds on just about any modern cheap computer. Such a surface is a great help in visualizing ligands in protein active sites. Compare the two images below, one uses an atomic stick figure representation and the other uses the re-entrant surface:

A stick representation of a ligand bound to Factor Xa (PDB 1FAX).
Note the difficulty of seeing how the active site is filled.
A surface used to visualze a ligand bound to Factor Xa.
Note that it is much easer to see how the ligand fills the active site.

One can certainly see the advantages of using these surfaces and, indeed, such surface visualizations are commonly used to interpret crystal structcure coordinates. One disadvantage is that a surface obscures the underlying protein atoms; however, surface transparency and coloring are often used to convey important interactions possibly obscured by the surface. Another disadvantage of using the re-entrant surfaces is that it provides little volumetric information - it can be difficult to estimate what modifications to a ligand would fit the active site.

The figure on the right shows the same Factor Xa complex with an additional surface rendered using lines. This surface is the Interaction surface defined to be the 0 kcal/mol van der Waals energy contour of a carbon atom interacting with the active site. In other words, the surface contains all points at which the van der Waals energy becomes positive (repulsive). This surface is much closer to the nucleii of possible ligands and better illustrates how much space is available for ligand modifications.

Generally speaking, the Interaction surface is more crowded in appearance than the re-entrant surface. However, when used in combination, these two surfaces provide a quick way to see the extent and shape of an active site as well as an estimate of the steric limits to which substituents will be subject. What is still not clear are the interactions favored by the active site - what modifications would lead to better interactions.

In this article we will examine a number of techniques that can be used to determine and display favorable interactions with active sites. These methods are a part of a single application in the Molecular Operating Environment (MOE) [CCG 1996]. By displaying interaction sites, scientists may be better able to optimize ligand features to either avoid unfavorable interactions or introduce new favorable interactions. Specifically, we will present techniques to:

  1. Predict favorable locations of electrically positive, neutral and negative atoms by displaying the results of an electrostatic field calculation that mimics mobile atoms.

  2. Predict favorable locations of hydrophobic and hydrophilic ligand atoms in an active site by displaying statistics gathered from crystal structures.

  3. Color code molecular surfaces to reflect the underlying chemical interactions of the active site.

The methodology to be described is sophisticated; however, in each case the calculation can be performed quickly, without user intervention, and the results are easily understood. This means that users can derive benefit without getting slowed down by arcane details and parameterizations.

The remainder of this article is divided into sections that describe the methdology of each technique as well as an example of its use. We conclude with several observations in the final section.

Electrostatic Maps

Electrostatic Maps in MOE are based on solutions to the Poisson-Boltzmann Equation (PBE). The PBE is a partial differential equation often used in the study of electrolites. Although there is a rich history of using the PBE for studying proteins (e.g., [Gilson 1987], [Sharp 1990] and [Grant 2001]) and calculating free energies of solvation we shall review its underlying theory and contrast it with the GRID methodology [Goodford 1985] which is also based upon electrostatic fields.

Poisson-Boltzmann Equation Theory. The starting point for the PBE is the Poisson equation. This is a fundamental equation of physics that relates the electrostatic potential φ and a charge density ρ at equilibrium:

Given an electrostatic potential φ one can calculate the underlying charge density ρ that gives rise to φ with the Laplacian operator d2/dx2 + d2/dy2 + d2/dz2 applied to φ. Given a charge density ρ one can calculate the electrostatic field φ by solving a partial differential equation. In the special case of a collection of point charges, φ is the familiar Coulomb potential.

If the charge density ρ is embedded in a polar medium then the electrostatic potential will be damped, or screened, in accordance with the polar character of the medium. Such strength is typically specified by a dielectric constant ε. This strength may be a function of spatial coordinates (e.g., ε = 1 in the interior of a protein and ε = 80 in the exterior to model a protein in water). A modification to the Poisson equation accounts for this dielectric "field":

In the special case of point charges and a constant ε then the Poisson equation produces Coulomb's law divided by the dielectric constant. What is important about the Poisson equation is that an electrostatic potential φ completely and uniquely determines the underlying charge density ρ through the Laplacian operator. In the case of a dielectric ε we can write the modified Poisson equation as

which shows that a solution to the modified Poisson equation can be thought of as the solution to the original Poisson equation but with a modified charge density. The difference between the modified charge density and the original density ρ is often referred to as the induced charge or the screening charge. This way of looking at the relationship between induced charges and the original charges is the key to understanding how the Poisson-Boltzmann equation arises.

Suppose that a charge density ρ is embedded in a polar medium specified by ε. Suppose, further, that in the medium there are a number of mobile ions each carring a formal charge qI. At thermodynamic equilibrium these mobile ions will be distributed about ρ according to a charge density which we shall denote by ρI. The electrostatic potential, φ, from the combined charge densities must satisfy the Poisson equation:

For multiple types of mobile ions, the foregoing equation is easily extended by replacing the lone ρI with a sum of similar densities (one for each type). For the purposes of exposition, only one mobile ion type will be considered. If we imagine that the mobile ions are spatially restricted (e.g., the ions cannot be inside a protein and must remain in solvent) then such a restriction can be specified with a spatial potential uI that has high energy in the forbidden regions of space and low energy in the allowed regions of space. Taken with the electrostatic potential φ the potential energy of a mobile ion at a particular point in space will be, at equilibrium, qI φ + uI. We now assume that the ion location density ρI follows a Boltzmann distribution and so the ion charge density is its formal charge multiplied by the probability it will be located at a particular point in space.

where c is normalization constant, k is Boltzmann's constant and T is the temperature of the system. Assuming that φ and uI are 0 sufficiently far from ρ then at such distances the exponential term is equal to 1 which means that c may be taken as the bulk concentration of the mobile ions (say in mol/L). The combination of the previous two equations results in the Poisson-Boltzmann Equation; namely that,

which specifies not only the resulting electrostatic potential, but also the ionic charge densities both of which result from an original charge density ρ in a polar medium ε. The PBE can be solved analytically only in special cases (e.g., in Debye theory for electrolites). For more complex cases, such as proteins, the PBE must be solved numerically. This is most commonly done by approximating each spatial quantity (such as ε, φ or ρ) by a discrete lattice (3D grid) and using finite differences to approximate the differentiation operators.

Calculation of Free Energies of Solvation. The calculation of the free energy of solvation (in water) of small molecules can be performed by solving the PBE. Such calculations are used to calibrate the parameters used in a PBE solution in the hope that the results are transferrable to proteins. Typically, the non-linear Boltzmann term is not used; rather, the dielectric region ε is used alone to model the polar solvent. In the solute interior (the molecule in question) a small value of ε is used (1 or 2) while in the solute exterior a high value of ε is used (e.g., 80 for water). The electrostatic energy of the system is calculated and the gas-phase energy is subtracted (as calculated with ε = 1 everywhere) to obtain the free energy of solvation. Commonly, an exposed surface area term is added to model non-classical effects (such as the hydrophobicity of halogens).

The entire calculation is dependent on how ε is discretized and modeled in the calculation. Not only are the results sensitive to the choice of partial charges, they are extremely senstive to the curvature of the transition region in ε between high dielectric and low dielectric. Although, much effort has been spent to define this region appropriately [Grant 2001] little attention has been paid to the non-linear Boltzmann term upon which MOE Electrostatic Maps are fundamentally based.

It is not commonly known that free energies of solvation can be calculated without the definition or use of the ε dielectric region at all! The correlation plot to the right shows the results of calculating the free energy of solvation of 495 small molecules using the PBE and AM1-BCC charges [Jakalian 2002] without a dielectric region. The r2 is 0.88. In this calculation, the non-linear PBE was solved with two pseudo-ion classes: an oxygen "ion" and a hydrogen "ion" along with the commonly used surface area term. TIP3P charges were for the {qI} formal ion charges at 1 mol/L concentrations and AMBER [Weiner 1986] van der Waals parameters were used to determine the accessible region of these "ions". Such a definition is not as arbitrary as the ε regions commonly used - it was not fitted to the solvation data - and has a more direct physical interpretation.

As can be seen readily, the implicit ion Boltzman term can effectively model free energies of solvation of small molecules - arguably to within the same accuracy as ε-based calculations. This evidence suggests that the resuting Boltzmann densities of the implicit ions (resulting from the solution of the PBE equation) would carry meaning. Moreover, these densities are 3D quantities and are amenable to vizualization via energy contour levels.

Electrostatic Maps Methodology. We will use the same ideas as presented above in the calculation of free energies of solvation using the non-linear Boltzmann term of the PBE. That is, we will calculate the Boltzmann probability density of two classes of particle - an oxygen particle, O and a hydrogen particle, H when given the 3D coordinates of a protein along with their partial charges. Specifically, we solve the following equation for the electrostatic potential φ:

which is the PBE with constant ε = 1, 1 mol/L ionic concentrations and two ion classes; ρ is the protein charge density approximated using a sum-of-Gaussians definitions (the width proprotional to the OPLS-AA [Jorgensen 1996] van der Waals radii); the spatial potentials uO and uH are van der Waals potentials of an O atom and the protein and an H atom and the protein using OPLS-AA parameters. The formal charge paramteres qO and qH are TIP3P water partial charges. MOE users a Preconditioned Conjugate Residual solver with a Multi-Grid preconditioner along with Newton's Method to solve the non-linear PBE on a lattice. The calculation requires several seconds to converge on a modern PC.

The result of the calculation is a lattice with values of φ, the equilibrium electrostatic potential of the protein charge density φ and the Boltzmann distributed ion densities for O and H species. Commonly, φ is displayed graphically as a molecular surface or as iso-contours in space. However, we are interested in the resulting O and H densities in the non-linear Boltzmann terms of the PBE. These are the quantities that we will plot. One could simply plot the non-linear terms directly which would be a plot of the ionic charge densities at equilibrium. However, it is more convenient to plot the potential to which the ionic species are subject. Specifically, we plot qO φ + uO and qH φ + uH which is a sum of the van der Waals and electrostatic potentials. Note the similarity to molecular field type methods; the difference, however, is that these methods use a Coulomb-type electrostatic potential, not an equilibrated PBE electrostatic potential. The hydrophobic map is derived from the O and H maps by calculating and plotting - (qO + qH) φ + uC where uC is the van der Waals potential for a carbon atom. Hydrophobes exists where O and H atoms are rare (in the more neutral regions). This quantity is related to the desolvation energy of a receptor in a localized region of space - hydrophobes prefer easily desolvated regions of space and the solvent is modeled by the O and H ionic densities.

Results and Applications. Fundamentaly, Electrostatic Maps are plots of van der Waals and electrostatic energies and, as such, they are similar in spirit to the GRID methodology [Goodford 1985]. However, there are important differences not only with the original GRID but with subsequent enhancements made to model solvation effects. These differences are related to the non-linearity and self-consistency of the Electrostatic Map fields - only one calculation is done so that each implicit ionic particle interacts (in a mean field sort of way) with the other implicit ionic particles; that is, each particle's final Boltzmann density depends on the density of the others and all depend on the solute charge density.

This non-linearity and self consistency utimately manifests itself in a more localized distribution of implicit ion densities. Approximate, or even exact ε = 80 solvation models can fail to adequately screen chemical groups leading to electrostatic domination of a region. The reason for this is that there is a non-linear screening effect in buffered solution that cannot be modeled by linear ε dielectric models. A simple example will serve to illustrate this point:

Electrostatic with van der Waals contour plots of a small molecule calculated by solving the PBE with ε = 80 solvent screening model. Positive preference is indicated in blue and negative preference is indicated in red. Notice that the carboxylate (indicated with the white arrow) dominates the entire region and no red contour is visible. Electrostatic with van der Waals contour plots of a small molecule calculated wby solving the non-linear PBE with ε = 1 throughout and "O" and "H" pseudo-ionic species. Notice that the screening effects are much stronger and that more localized positive (blue) and negative (red) preferences are visible (white arrows).

In ε = 80 calculations (or approximations thereto), charged groups are not sufficiently screened by solvent effects. This is due to the linear screening of the dielectric region. Charged groups in aqueous environments are neutralized by moblie counter-ions that provide non-linear screening effects and must be modeled using non-linear Boltzmann term of the PBE. Moreover, charged groups in systems of biological importance exist in buffered solution. Consequently, implicit solvent models must take these into account. The two images above show clearly the difference between linear and non-linear screening. The MOE Electrostatic Map methodology (on the right) inherently will neutralize the carboxylate in the upper right by modeling implicit counter-ions thus allowing the negative preference in the lower left to appear (in contrast to the image on the left). Non-linear screening produces more localized Boltzman densities than simple VDW+Coulomb or VDW+Coulomb+ε maps (such as produced by GRID or other traditional molecular field calculations). In addition, traditional calculations are not self-consistent - each contour, whether positive or negative, is calculated separately so that neutralization effects are not taken into account.

The Electrostatic Map methodology can be applied, of course, to structures of biological interest. The following images are contours of Electrostatic Maps of several protein amino acid sidechains using AMBER charges at ca. -2 kcal/mol contour levels for positive and negative preferences.

Aside from detailing the expected polar hydrogen bonding preferences one can see the preference for positive charge above and below aromatic groups (blue surface). This is due to the concentration of negative charges around the heavy atoms that make up the aromatic rings. This negative region will generally be in contact with the mildly positive hydrogen of an otherwise hydrophobic group (such as -CH2-). These images should be compared to the probabilistic potentials described below which support this assertion.

The Electrostatic Maps can be calculated for receptors (without ligands). Consider the Factor Xa complex in PDB entry 1FAX. The following is a MOE calculated Protein-Ligand schematic of key interactions. In the diagram below hydrophilic residues are purple, blue rings indicate basic groups, red rings indicate acidic groups, hydrophobic, residues are green, arrows are hydrogen bonds, blue discs on ligand atoms indicate solvent exposure when bound, light blue shadows indicate contact surface area):

The key interactions are S1 pocket interactions (on the left) consisting of the salt bridge between the ligand amidinium group on the napthalene ring and ASP 189 and the hydrophobic interactions between the napthalene ring and the sandwich of the backbone atoms of TRP 215 / GLY 216 on one side and GLN 192 on the other; and in the P pocket (on the right) consisting of the salt bridge between GLU 97 and the CNN+ structure on the five membered ring of the ligand and the hydrophobic interactions with the sidechains of TYR 99 / TRP 215 / PHE 174 and the said five membered ring.

An Electrostatic Map for 1FAX with positive, negative and neutral contours was calculated without the ligand. AMBER charges were used for all of the atoms. This map is depicted below along with a re-entrant surface and pocket labels.

An Electrostatic Map of Factor Xa calculated from the receptor structure alone.
Notice the localized features especially in the P pocket. The S1 pocket is somewhat buried.

In the above image, positive preference is indicated in blue, negative in red and neutral in white. The contour levels are ca. -2 kcal/mol. The following image is that of the Electrostatic Map superimposed with the ligand of 1FAX and the receptor structure and surface hidden to help visualize the correspondence between the prediction and the ligand features.

The Electrostatic Map for Factor Xa superimposed with the 1FAX ligand.
The top left inset provides an alternative view of the P pocket map and the bottom right inset provides an alternative view of the S1 pocket map.

This image shows the detail provided by the Electrostatic Map and the very good agreement between the map features and the key interactions. Notice in particular that the carboxylates of ASP 189 and GLU 97 are screened by the strong positive implicit ions (the blue contour). This positive ionic density effectively neutralizes the carboxylate groups and makes it possible for large neutral regions to exist in the P and especially the S1 pockets. The connecting linker group has no strong electrostatic interactions with the receptor and is largely exposed to solvent.

The Electrostatic Map can be a useful tool for predicting electrostatic features in a protein active site. The inputs are the 3D atomic coordinates of the receptor (hydrogen atoms included) as well as the partial charges of the protein. These maps are especially useful in strongly ionic environments where screening effects play an important role. The self-consistent ionic density calculation reflects an implied "ideal neutralizing ligand" from the electrostatic perspective. The contour levels are on a transferrable and realistic scale making protein comparisons possible. While the basis is a rather complex electrostatic model, the user need not be concerned with the details. What little parameteriztion there is was independently developed, validated with free energy of solvation experimental data and successfully applied to predicting protein-ligand interactions. There is no need for the user to modify the parameters save for setting the energy levels for the contour plots.

Contact Preferences

The amount of available protein crystallographic data in the Protein Data Bank (PDB) is large enough that it can be subjected to a purely statistical analysis of non-bonded contacts. Such an analysis could then be used in a "predictive" manner by displaying known contact statistics from the PDB in the reference frame of novel structures. In this way, a scientist could visualize what contacts are common in the PDB and what contacts are rare. Thus, the "prediction" is merely a summary of known data rather than the results of a simulation. This is an important distinction from predictions that result from simulations.

MOE contains a unique methodology [Labute 2000] for the determinination and display of the statistics of non-bonded contacts. The method is similar in spirit to the X-Site program [Laskowski 1998], the SuperStar program [Nissink 2000] and the methodology of [Rantanen 2001] all of which are knowledge-based techniques (fitted distributions from crystallographic data). The purpose of the MOE Contact Preferences is to calculate, from the 3D atomic coordinates of a receptor, preferred locations for hydrophobic and hydrophilic ligand atoms. In particular, directional preferences of receptor atoms are taken into account and, moreover, hydrogen atoms and partial charges are not required for the calculation. Fundamentally, Contact Preference maps are contours of probability densities indicating a percentage likelihood of a protein non-bonded contact with a particular ligand atom type.

Suppose that we are given a three dimensional molecular structure of a receptor made up of individual atoms each with a receptor atom type (to be defined later). We are interested in predicting the preferred positions of atoms in hypothetical bound partners in contact with the given structure. We assume that bound partners are made up of atoms each of a ligand atom type (to be defined later). Let S denote a random variable over molecular structures (assumed to be in a common frame of reference), X denote a random point over three dimensional space, T a random variable over the receptor atom types, and L a random variable over the ligand atom types. In this work we will consider methods to determine and apply the joint probability density

Pr (L = l, X = x, S = s) (1)

to the problem of the prediction of preferred ligand type locations. One can think of Eqn.(1) as the likelihood of observing a ligand atom of type l at location x in contact with a structure s. Note that this probability density includes the receptor atom types implicitly (through S). In particular, the conditional probability density

Pr (L = l | X = x, S = s) (2)

is the probability of observing a ligand atom of type l assuming that a ligand atom is located at position x in contact with structure s. If s is a protein structure then Eqn.(2) provides the means to compare the structure's preference for different ligand types at particular point in space; for example, with two ligand atom types "hydrophobic" and "hydrophilic", Eqn.(2) can be used to map the high probability hydrophobic and hydrophilic contact regions in a protein's active site since it is a probability distribution over a discrete collection of types (in this case, two). The values are percentages of likelihood of contact (e.g., 65% of the time there is a hydrophobic atom at that point in space). Since such an assessment is made for every point, it is natural to display the maps in the form of a 3D graphical contour surface at a given probability.

Our intention is to approximate Eqn.(1) and Eqn.(2) for each atom in the receptor using one-dimensional analytical functions whose parameters are estimated from a collection of X-ray crystallographic structures (the PDB). Bayes Theorem is then used to put these one dimensional distributions into the form of Eqn.(2). We will choose atom types and functional forms intended to be useful for predicting hydrophobic and hydrophilic atom types.

Contact Distributions. The first approximation is to assume that each structure s is composed of atoms each having a receptor atom type and that, further, the statistics gathered on each of the receptor atom types can be composed to approximate the distribution over the entire structure. More precisely, we assume that

Pr (X = x, L = l | S = s) = c⋅maxi { Pr (X = x | L = l, Ti = ti, yi, Out(x,l,s)) ⋅ Pr (L = l | Ti = ti) } (3)

where c is a normalization constant, i ranges over the atoms of s, yi are the coordinates of atom i in the structure, and Out(x,l,s) denotes the condition that the atom type l at location x is exterior to the structure s. The Pr(L = l | T = ti) term is estimated straightforwardly using relative frequencies of contacts in a training set. Eqn.(3) is the fundamental approximation that reduces the estimation of densities about entire structures to the estimation of densities about particular primary atom types t. In other words, we compose probability densities over structures s from probability densities of the form Pr (X = x | L = l, T = t, yi) which we approximate by representing x in a local spherical-type coordinate system about the structure atom i located at yi that is invariant to rotations and translations, namely

Pr (X = (r,a,p) | L = l, T = t) (4)

For a given primary atom located at x0 and an interacting atom located at x different from x0, we take


where u and v are local coordinate directions (described below). The quantities can be thought of as the interaction distance r (Å), the lone-pair interaction angle a (degrees) and the out-of-plane angle p (degrees). The vectors u and v in Eqn.(5) depend only on the hybridization (HYB) and heavy atom coordination (HAV) of the primary atom or its neighbors. The following is a schematic show how the u and v directions are defined (the blue arrows); the green atom is the receptor atom for to the directions the atoms apply and the white atoms are nearby bonded atoms.

With sufficient training data, the three-dimensional probability density of Eqn.(4) can be estimated with a histogram; however, for many type combinations there is insufficient structural data available to accurately estimate the required histogram. For this reason, we assume independence of r, a, and p and approximate Eqn.(4) with a product of three one-dimensional densities:

Pr(X = (r,a,p) | L = l, T = t) = Pr(R | l, t) ⋅ Pr(A | l, t) ⋅ Pr(P | l, t) (6)

where R denotes a random variable over the positive real numbers, A denotes a random variable over the interval [0,180] and P denotes a random variable over the interval [0,90]. Inspections of contact statistics revealed that the data were well described by a small number of independent analytical functional forms on r, a and p, the interaction distance, angle and out-of-plane angle, respectively. The collection of functional forms includes the gamma, log normal, Cauchy, beta and Lennard-Jones densities. Each probability density has a small number of parameters which must be estimated from sampled structure data.

The following procedure is used to estimate the parameters used in the approximation of Eqn.(3):

  1. For each combination of receptor atom type t and ligand atom type l, collect from the training set all atom pairs i and j in contact and for which the receptor type of i is t and the ligand type of j is l.

  2. For each such contact pair, use Eqn.(5) to calculate r, a and p (taking atom i as the receptor atom). This calculation produces three sets of samples {rk}, {ak} and {pk} each of size mtl.

  3. Use the three sets of samples {rk}, {ak} and {pk} to estimate the parameters of an analytical probability density chosen for the type combination (t,l).

  4. Estimate Pr(L = l | T = t) from Eqn.(3) as (mtl + 1) / (mt + nl) where nl is the number of ligand atom types and mt is the sum of the mtl determined in Step 2 over all ligand atom types l.

In this way, we have composed the joint density of Eqn.(1) with a collection of one-dimensional probability densities of the form given in the right-hand side of Eqn.(6) each of which has parameters estimated from X-ray crystallographic data.

Contact Criteria. When are two atoms in non-bonded contact? This is a surprisingly subtle question. Broadly speaking, two atoms are in contact if they are "close together" with no other atoms "in between" and are not in bonded contact. It is the quantification of these notions that is nontrivial, especially if no a priori radii are assumed for the atoms. After much experimentation and inspection of X-ray crystal structures, we settled on the following definition of atomic non-bonded contact:

Atom B is said to be in non-bonded contact with atom A if all of the following conditions are satisfied: a) the distance between A and B is less than 4.5 Å; b) the line joining A and B does not intersect any of the bonded neighbors of either A or B using a value of 1.0 for the radius of each neighbor; c) B is closer to A than to any other atom not related to B by a bond, angle or torsion; and d) A and B are not related by a bond, angle or torsion.

The choices of 4.5 as distance cutoff and of 1.0 as radius in the line of sight test tend to work well together in that it is rare for a clear non-bonded contact to be excluded and for cases such as second layer carbonyl carbons to be included. Additionally, the derived distance histograms taper smoothly to zero within the cutoff.

Atom Types. In order to estimate reasonably accurate analytical forms for the distribution of atom interactions, it was necessary to subdivide the atom pair interactions into classes. Both the primary atom and the ligand atom were classified into types, and for each combination of primary atom type and ligand atom type, separate statistics were gathered. In other words, we estimated a collection of probability densities each of the form Pr(R=r, A=a, P=p | T1=I, T2=J), the likelihood of observing (r,a,p) assuming that the primary atom is of type I and the ligand atom is of type J. After initial inspection of contact statistics in high resolution protein structures, we settled upon the collection of primary atom types defined in the following table (u indicates whether the atom has a u direction and v indicates whether the atom has a v direction in the local coordinate system):

Type u v Description
T_CH1 yes no sp3 carbon with 1 hydrogen
T_CH2 yes yes sp3 carbon with 2 hydrogens
T_CH3 yes no sp3 carbon with 3 hydrogens
T_cQ3 no yes sp2 carbon with 3 heavy neighbors
T_cQ2 yes yes sp2 carbon with 2 heavy neighbors
T_NQ1 yes no sp3 nitrogen with 1 heavy neighbor
T_NQ2 yes yes sp3 nitrogen with 2 heavy neighbors
T_NQ3 yes no sp3 nitrogen with 3 heavy neighbors
T_nQ1 yes yes sp2 nitrogen with 1 heavy neighbor
T_nQ2 yes yes sp2 nitrogen with 2 heavy neighbors
T_nQ3 no yes sp2 nitrogen with 3 heavy neighbors
T_o1 yes yes sp2 oxygen with 1 heavy neighbor
T_O1 yes yes sp3 oxygen not bonded to pi system with 1 heavy neighbor
T_O1i yes yes sp3 oxygen bonded to pi system with 1 heavy neighbor
T_O2 yes yes any oxygen with 2 heavy neighbors
T_OW yes yes any oxygen with 0 heavy neighbors (water)
T_SQ1 yes no sulfur with 1 heavy neighbor
T_SQ2 yes yes sulfur with 2 heavy neighbors
T_F1 yes no fluorine with 1 heavy neighbor
T_Cl1 yes no chlorine with 1 heavy neighbor
T_Br1 yes no bromine with 1 heavy neighbor
T_I1 yes no iodine with 1 heavy neighbor
T_TM no no transition metal

Ligand atoms were divided into one of two classes: LPA and HYD. The LPA type consisted of nitrogen and oxygen atoms capable of participating in hydrogen bonds or metal interactions and the HYD type consisted of carbon and sulfur (hydrophobic atoms). The ligand atom types are largely invariant to protonation state and tautomer state.

Training Data and Results. Non-bonded contact pairs of atoms were extracted from the Protein Data Bank as of December 31, 2004. Approximately 4,000 x-ray structures with resolution 2.0 Angstroms or better were used. Non-bonded contact pairs between backbone atoms and beta carbons were omitted since it was felt they would introduce a strong bias in the statistics. The majority of contact pairs involved sidechain to sidechain contacts.

A typical set of statistics for an atom type is depicted below. The diagram shows the results for the type T_nQ1 which is a planar nitrogen with one heavy neighbor. The contact frequecies are displayed at the top in percent (i.e., 44% percent of the time a hydrophobic atom contacted the T_nQ1 atom and 56% of the time an LPA, or hydrophilic atom was in contact).

The three plots show the probability densities for the (r,a,p) coordinates. The dotted lines show the raw histograms and the solid lines show the results of the analyitical fits. On the right there is a statement of the analytical functional forms and sketches of some chemical contexts of the T_nQ1 atom type. Also, there is pictorial representation of the Contact Preference map for the T_nQ1 atom at 95% preference level. Hydrophobic contacts are green (above the plane) and hydrophilic contacts are red (in the plane). Similar contact preferences are given below for a number of receptor atom types:

One can clearly see that the methodology has captured the directional preferences of the atoms. Contact Preference maps of more complex molecules are composed from the individual atomic preferences. By way of example, below are depictions of the Contact Preferences (at ca. 95% probability) for a few protein side chains.

As expected, in the above planar side chains, the hydrophobic contacts are above the pi system plane and the hydrophilic contacts are largely in the plane. Somewhat surprisingly, there is strong out-of-plane hydrophobic character for pi systems even on top of hetero atoms. The guanidinium group of ARG is a particulary nice example of this show the very strong hydrophobic character of the guandidinium group above the plane.

The Contact Preference maps can be calculated for receptors (without ligands). Consider the Rolipram/PDE complex in the PDB 1RO6 entry. The following is a MOE calculated Protein-Ligand schematic of key interactions. In the diagram below hydrophilic residues are purple, blue rings indicate basic groups, red rings indicate acidic groups, hydrophobic, residues are green, arrows are hydrogen bonds, blue discs on ligand atoms indicate solvent exposure when bound, light blue shadows indicate contact surface area):

The following two pictures are of the PDB 1RO6 protein-ligand complex with the Rolipram ligand with Contact Preference maps displayed.

The re-entrant molecular surface and the Contact Preference map for Rolipram in the 1RO6 complex. The green contour denotes hydrophobic preference and the purple denotes hydrophilic preference. The Contact Preference map calculated solely from the 1RO6 receptor displayed on the Rolipram ligand with the receptor and surface hidden. The contour probability levels are both 92%.

In the above right image, one can clearly see the successful prediction (from the receptor structure alone) the hydrophic preferences satisified by the benzene / ILE 410 and the cyclopentyl / PHE 446 interactions. Also, one can clearly see that Rolipram satisifies the important hydrophilic interactions with the carbonyl oxygen and HIS 234 in the upper right and the oxygens in the catechol group with GLN 443 in the lower left.

The Contact Preference distributions are also useful for color coding molecular surfaces. Color coding a surface by underlying nearest receptor atom type often hides the directional preferences of those atoms. We can use the analytic distributions to smoothly color a molecular surface to encode these receptor directional preferences.

On the right, one can see the 1RO6 complex with a re-entrant molecular surface. The color coding of the surface was performed without using the rolipram ligand. Green regions are hydrophobic, purple are hydrophilic and the blue regions are mildly polar in-pi-plane contacts (e.g., with the hydrogens of benzene).

A closer examination reveals the good agreement with the purple region close to the catechol group and the green hydrophobic regions above the plane of the benzene ring and around the cyclopentyl group (all in the lower left). One can also see the agreement with the T_OQ2 contact statistics in that this type has hydrophobic character out of its bonding plane. It is also not difficult to see that both the receptor and ligand's mildly polar regions (the in-pi-plane aromatic) are exposed to solvent and if one were to include the ligand in the surface coloring, the blue regions of both the ligand and receptor would be exposed to the solvent.

It is important to note that since the same contact distributions are used, statistical contact information is consistently displayed whether encoded in surface coloring or rendered as volumetric regions in the Contact Preference maps. This is important since active sites are often crowded and difficult to visualize in 3D. The surface color coding captures many of the interaction types while leaving the active site volume available for further visualizations.

It should be stressed that the Contact Prefence calculations, whether for surface encodings or for volumetric rendering, are fast. It requires only a few seconds to calculate and display the maps. Because of this, they need not be stored - they can be calculated on demand - whether for one receptor or a family of receptors (for selectivity studies). Finally, although the methodology is sophisticated, users need not be concerned with the precise details. Rather, the results can be summarized as a percent likelihood of a particular hydrophobic or hydrophilic region derived from PDB non-bonded contact statistics. There are no tuning parameters to set, protons need not be present and the results are on a consistent scale independent of the receptor.


We have presented a number of methods for predicting favorable locations of ligand features in protein active sites:

  • Electrostatic Maps that predict the electrostatically preferred locations of positive, negative and neutral donor locations.
  • Contact Preference preference maps that predict the preferred locations of hydrophobic and hydrophilic locations.
  • Color coded molecular surfaces that show receptor directional preferences for hydrophilic and hydrophobic contacts.

Each of the calculations can be effected in a few seconds and the results displayed with commonly available 3D graphics capabilities of most computers. Although the underlying methodology is complex, users need not be concerned with the precise details nor need they be concerned with the setting of parameters. In MOE these capabilities are integrated into a single application called Surfaces and Maps with three different pages, or views (allowing users to quickly select the surface mapping method of interest):

After specifying the receptor atoms and extent of the plot, users need only press OK and then manipulate the contour level sliders to visualize the results. Consequently, such maps can be routinely used by all scientists.


[CCG 1996] The Molecular Operating Environment (MOE) available under license from Chemical Computing Group Inc., 1010 Sherbrooke St. W. Suite 910, Montreal, Quebec, Canada H3A 2R7;
[Connolly 1983] Connolly, M.L.; Solvent-accessible surfaces of proteins and nucleic acids; Science 221 (1983) 709-713.
[Gilson 1988] Gilson, M., Sharp, K.A., Honig, B.; Calculating the Electrostatic Potential of Molecules in Solution: Method and Error Assessment; J. Comput. Chem. 9 (1988) 327-335.
[Goodford 1985] Goodford, P.J.; A Computational Procedure for Determining Energetically Favorable Binding Sites on Biologically Important Macromolecules; J. Med. Chem. 28(7) (1985) 849-856
[Grant 2001] Grant, J.A., Pickup, B.T., Nicholls, A.; A Smooth Permittivity Function for Poisson-Boltzmann Solvation Methods; J. Comput. Chem. 22(6) (2001) 608-640.
[Jakalian 2002] Jakalian, A.; Jack, D.B.; Bayly, C.I.; Fast, Efficient Generation of High-Quality Atomic Charges. AM1-BCC Model: II. Parameterization and Validation. J. Comput. Chem. 23 (2002) 1623-1641.
[Jorgensen 1996] Jorgensen W.L., Maxwell, D.S., Tirado-Rives, J.; Development and Testing of the OPLS All-Atom Force Field on Conformational Energetics and Properties of Organic Liquids; J. Am. Chem. Soc. 117 (1996) 11225-11236.
[Labute 2000] Labute, P.; Analytic Fitting of Non-bonded Contact Statistics of the PDB; The Molecular Operating Environment (2000)
[Laskowski 1998] Laskowski, R.A., Thornton, J.M., Humblet, C., Singh, J.; X-SITE: Use of Empirically Derived Atomic Packing Preferences to Identify Favorable Interaction Regions in the Binding Sites of Proteins; J. Mol. Biol. 259 (1998) 175-201.
[Lee 1971] Lee, B., Richards, F.M.; The Interpretation of Protein Structures: Estimation of Static Accessibility; J. Mol. Biol. 55 (1971) 379-400.
[Liang 1998] Liang, J., Edelsbrunner, H., Fu, P., Sudhakar, P., Subramanian, S.; Analytical Shape Computation of Macromolecules: I. Molecular Area and Volume Through Alpha Shape; Proteins 33 (1998) 1-17.
[Nissink 2000] Nissink, J.W.M., Verdonk, M.L., Klebe, G.; Simple Knowledge-Based Descriptors to Predict Protein-Ligand Interactions. Methodology and Validation; J. Comput. Aided Mol. Des. 14 (2000), 787-803.
[Rantanen 1996] Rantanen V., Denessiouk, K.A., Gyllenberg, M., Koski, T., Johnson, M.S.; A Fragment Library Based on Gaussian Mixtures Predicting Favorable Molecular Interactions; J. Mol. Biol. 313 (2001) 197-214.
[Richards 1977] Richards, F.M.; Areas, volumes, packing, and protein structures; Annu. Rev. Biophys. Bioeng. 6 (1977) 151-176.
[Rocchia 2002] Rocchia, W., Sridharan, S., Nicholls, A., Alexov, E., Chiabrera, A., Honig, B.; Rapid Grid-Based Construction of the Molecular Surface and the Use of Induced Surface Charge to Calculate Reaction Field Energies: Applications to the Molecular Systems and Geometric Objects; J. Comput. Chem. 23 (2002) 128-137.
[Sanner 1996] Sanner, M., Olson, A., Spehner, J.; Reduced Surface: an Efficient Way to Compute Molecular Surfaces; Biopolymers 38 (1996) 305-320.
[Sharp 1990] Sharp, K.A., Honig, B.; Electrostatic Interactions in Macromolecules: Theory and Applications; Annu. Rev. Biophys. Biophys. Chem. 19 (1990) 301-332.
[Weiner 1987] Weiner, S.J., Kollman, P.A., Nguyen, D.T., Case, D.A; An All Atom Force Field for Simulations of Proteins and Nucleic Acids; J. Comp. Chem. 7 (1986) 230.