Probabilistic Receptor PotentialsPaul Labute
Chemical Computing Group Inc.
Given the three-dimensional structure of a receptor, it is often of interest to determine the preferred locations of potential ligand atoms. For example, knowing the locations of hydrogen bonding ligand atoms or of hydrophobic regions in a ligand can aid in the design of novel compounds that bind to a particular receptor. Forcefield-based approaches such as [Goodford 1985] can provide a good indication of regions of strong van der Waals attraction; however, it is more difficult to use electrostatics terms to classify regions as hydrophilic or hydrophobic. For example, carboxylate groups are hydrophobic above and below the sp2 plane and hydrophilic in the plane. Although one could argue that the hydrophobic preference is represented by the lack of an electrostatic potential in the region perpendicular to the dipole, such distinctions are not readily visible in potential energy contour plots. Statistics-based approaches such as [Mugge 1999] or [Bruno 1997] are attempts at determining preferred locations of bound atoms from experimental data. In principle, the (representative) training set provides the preference information without the need for physical functional forms: histograms and other raw-data methods are used to map preferences. Unfortunately, these methods are not amenable to optimization since (in many cases) the statistics of individual interactions are not smooth functions and many examples are required before continuity can be achieved.
In this article, we describe a methodology for calculating receptor preferences that is both knowledge-based (in the sense that experimental data is used) and fitted (in the sense that analytical functional forms are used to describe the experimental statistics). After describing the methodology, we show examples of the resulting preference maps for amino acid residues and for an X-ray ligand-protein complex. Finally, we suggest a number of future applications of the methodology.
Suppose that we are given a three-dimensional molecular structure made up of individual atoms each with a receptor atom type (to be defined later). We are interested in predicting the preferred positions of atoms that are both in hypothetical bound partners and in contact with the given structure. We assume that bound partners are made up of atoms each of a ligand atom type (to be defined later). Let S denote a random variable over molecular structures (assumed to be in a common frame of reference), X a random point over three-dimensional space, T a random variable over the receptor atom types, and L a random variable over the ligand atom types. In this work, we will consider methods to determine and apply the joint probability density
to the problem of predicting the preferred locations of ligand atoms of different types. One can think of Eqn.(1) as the likelihood of observing a ligand atom of type l at location x in contact with a structure s. Note that this probability density includes the receptor atom types implicitly (through S). In particular, the conditional density
is the probability of observing a ligand atom of type l assuming that a ligand atom is located at position x in contact with structure s. If s is a protein structure then Eqn.(2) provides the means to compare the structure's preference for different ligand types at a particular point in space; for example, with two ligand atom types "hydrophobic" and "hydrophilic", Eqn.(2) can be used to map the high probability hydrophobic and hydrophilic contact regions in a protein's active site.
Our intention is to approximate Eqn.(1) and Eqn.(2) using one-dimensional analytical functions whose parameters are estimated from a collection of X-ray crystallographic structures (the Protein Data Bank (PDB)). We will choose atom types and functional forms intended to be useful for predicting hydrophobic and hydrophilic atom types.
Contact Distributions. The first approximation is to assume that each structure s is composed of atoms each of a receptor atom type and that, further, the statistics gathered on each of the receptor atom types can be composed to approximate the distribution over the entire structure. More precisely, we assume that
where c is a normalization constant, i ranges over the atoms of s, yi are the coordinates of atom i in the structure, and Out(x,l,s) denotes the condition that the atom of type l at location x is exterior to the structure s. The Pr(L = l | T = ti) term is estimated straightforwardly using the relative frequencies of contacts in a training set. Eqn.(3) is the fundamental approximation that reduces the estimation of densities about entire structures to the estimation of densities about particular primary (receptor) atom types t. In other words, we compose probability densities over structures s from probability densities of the form
We approximate Eqn.(4) by representing x in a local spherical-type coordinate system about the structure atom i located at yi that is invariant to rotations and translations, namely
For a given primary atom located at x0 and an interacting atom located at x different from x0, we take
where u and v are local coordinate directions (described below). If x is equal to x0, we set r = a = p = 0. The quantities can be thought of as the interaction distance r (Å), the lone-pair interaction angle a (degrees), and the out-of-plane angle p (degrees). The vectors u and v in Eqn.(6) depend only on the hybridization (HYB) and heavy atom coordination (HAV) of the primary atom or its neighbors and are calculated as follows (atoms participating in conjugated single bonds are considered sp2):
The following diagram depicts the u and v directions for a few example atoms (marked by "A"). The left diagram shows the case of an sp3 atom with HAV = 3. In this case, v is (0,0,0). In the middle diagram, atom A is sp2 with HAV = 2. In the right diagram, A is sp3 with HAV = 1 and is bonded to a pi system.
With sufficient training data, the three-dimensional probability density of Eqn.(5) can be estimated with a histogram; however, for many type combinations there is insufficient available structural data to accurately estimate the required histogram. For this reason, we assume independence of r, a, and p and approximate Eqn.(5) with a product of three one-dimensional densities:
where R denotes a random variable over the positive real numbers, A denotes a random variable over the interval [0,180], and P denotes a random variable over the interval [0,90]. Inspections of contact statistics revealed that the data were well described by a small number of independent analytical functional forms on r, a, and p, the interaction distance, lone-pair interaction angle, and out-of-plane angle, respectively. The collection of functional forms includes the gamma, log normal, Cauchy, beta and Lennard-Jones densities. Each probability density has a small number of parameters which must be estimated from sampled structure data.
The following procedure is used to estimate the parameters used in the approximation given in Eqn.(3):
In this way, we have composed the joint density of Eqn.(1) with a collection of one-dimensional probability densities of the form given in the right-hand side of Eqn.(7) each of which has parameters estimated from X-ray crystallographic data.
Contact Criteria. When are two atoms in non-bonded contact? This is a surprisingly subtle question. Broadly speaking, two atoms are in contact if they are "close together" with no other atoms "in between" and are not in bonded contact. It is the quantification of these notions that is nontrivial, especially if no a priori radii are assumed for the atoms. Atoms related by a bond or a bond angle are clearly in bonded contact and so we can dispense with these cases. Atoms related by a torsion tend to be constrained by bonded forces and so we can dispense with these cases as well. Consider, now, the definition of "close together".
The set of all atoms at a distance less than or equal to 4.5 Å from a given atom a generally leads to the inclusion of the first "layer" of atoms (if any) surrounding a (independent of element type) and the exclusion of the second "layer". Smaller values of the cutoff often exclude clear non-bonded contacts (e.g., some methyl-methyl contacts). Larger values include more second layer atoms which must be filtered by other means. What seems certain is that no single cutoff value can capture the notion of non-bonded contact. Even a cutoff value of 4.5 can include the second layer of atoms: if the atom a is a nitrogen in contact with a carbonyl oxygen at typical hydrogen bond distances then the carbonyl carbon will be within the cutoff even if the a-C-O angle is 180 degrees. It is tempting to use a different cutoff for each element; however, inspection of X-ray crystal structures revealed that these parameters would have to be finely tuned and even then would not eliminate either the exclusion of obvious contacts or the inclusion of second layer atoms.
A "line of sight" test appears to be a good method for eliminating any second layer atoms; that is, eliminate any contact atom pairs in which the line joining the atoms intersects other atoms. However, atomic radii are required to apply this test. The choice of radii is not an obvious one; for example, van der Waals radii tend to be too large and eliminate too many contacts and it is not clear that van der Waals radii (whatever set is used) are appropriate choices. A safe line of sight test can be applied with conservatively small artificial radii as long as not too many second layer atoms are included by the distance cutoff criterion. One way of obviating the need for a line of sight test is to consider only pairs of atoms which are mutually closest pairs; unfortunately, statistics gathered based on such a definition would contain a bias since only the closest of contacts would be retained. After much experimentation and inspection of X-ray crystal structures, we settled on the following definition of atomic non-bonded contact:
Atom B is said to be in non-bonded contact with atom A if all of the following conditions are satisfied: a) the distance between A and B is less than 4.5 Å; b) the line joining A and B does not intersect any of the bonded neighbors of either A or B using a value of 1.0 Å for the radius of each neighbor; c) B is closer to A than to any other atom not related to B by a bond, angle or torsion; and d) A and B are not related by a bond, angle or torsion.
The choices of 4.5 as distance cutoff and of 1.0 as radius in the line of sight test tend to work well together in that it is rare for a clear non-bonded contact to be excluded and for cases such as second layer carbonyl carbons to be included. We do not believe that our results are sensitive to these choices since the derived distance histograms taper smoothly to zero within the cutoff and a radius of 1.0 is very conservative for all elements except hydrogen (which is excluded from our statistics). The nearest neighbor criterion was judged necessary for cases such as the following. Suppose B is in contact with A and C according to the first two tests in the definitions and further suppose that the angle B-A-C is 100 degrees with A equidistant from B and C. In such a case we consider that B is in contact with A and not C (because it is closer to A than to C). We assume that any ties (e.g., equilateral triangles) or noise (e.g., almost equilateral triangles) will not adversely affect the statistics. It is worth noting that although we have introduced a nearest neighbor criterion, it is not the same as using mutually nearest neighbors: the fact that B contacts A does not stop other atoms from being judged in contact with A; hence, the statistics gathered for A using this definition will not have a strong bias introduced by the nearest neighbor criterion.
Receptor Atom Types. In order to estimate reasonably accurate analytical forms for the distribution of atom interactions, it was necessary to subdivide the atom pair interactions into classes. Both the primary (receptor) atom and the ligand atom were classified into types, and for each combination of primary atom type and ligand atom type, separate statistics were gathered. In other words, we estimated a collection of probability densities each of the form Pr(R=r, A=a, P=p | T1=I, T2=J), the likelihood of observing (r,a,p) assuming that the primary atom is of type I and the ligand atom is of type J. After initial inspection of contact statistics in high resolution protein structures, we settled upon the collection of primary atom types defined in the following table (u indicates whether the atom has a u direction and v indicates whether the atom has a v direction in the local coordinate system):
Ligand Atom Types. Ligand atoms were divided into one of two classes: LPA and HYD. The LPA type consisted of nitrogen and oxygen atoms capable of participating in hydrogen bonds or metal interactions, and the HYD type consisted of carbon and sulfur (hydrophobic atoms). The ligand atom types are invariant to protonation state and tautomer state.
The methodology described above was applied to all X-ray structures in the PDB with a resolution of 2.0 angstroms or better. Thus, the training set consisted of over 4,000 protein and nucleic acid structures. The following diagram depicts the typical quality of the fitted functional forms. The top plot contains information regarding the interatomic distance (r), the lone-pair interaction angle (a) and the out-of-plane angle (p) described above. The red curves depict interactions with the LPA ligand type (e.g., a donor or acceptor) and the green curves, interactions with the HYD ligand type. The dotted curves are the histograms of the experimental (training) data while the solid curves are the fitted analytical probability distributions. The receptor atom type is T_nQ1, which is an sp2 nitrogen with one heavy neighbor.
In general, log normal and 12-6 distance distributions describe the distance data well. Beta, exponential and Cauchy distributions describe both the lone-pair interaction angle and the out-of-plane angle data well. The following plots are for the CH2 atom type (sp3 carbon with two heavy neighbors):
If we compose the fitted distributions according to Eqn.(2), we obtain a preference probability map for each receptor atom type. In the following maps, the green surface is a 90% probability isocontour for the HYD (hydrophobic) type, and the red surface, a 90% probability isocontour for the LPA (donor/acceptor) type.
The following images are preference maps (at 95% probability) using Eqn.(2) for amino acid residues containing hydrophilic sp2 atoms. The strong directional preferences are clearly visible.
The following image is of the active site of streptavidin (the biotin receptor). Eqn.(2) was used to produce the preference map (green is HYD at 92% and red is LPA at 92%).
We can see that the natural ligand (biotin) fits the preference maps extremely well:
We have presented a method to calculate the preferred locations of potential ligand atoms when given the 3D coordinates of a receptor. The method is based upon probabilistic potentials using experimental coordinates as a training set. The method presented here has several advantages over related efforts. Firstly, heavy atoms alone are used; this means that tautomer and protonation state issues need not be resolved prior to receptor analysis. Secondly, the preference probabilities are just that: probabilities; the numbers are always in the range [0,1] and are transferable between different receptors (in contrast to energy-based methods).
Future work includes applying the method for:
All of the software was implemented in the SVL programming language in MOE and is available in MOE 2001.01.