Journal Articles

A Widely Applicable Set Of Descriptors


Paul Labute
Chemical Computing Group Inc.
1010 Sherbrooke Street W, Suite 910; Montreal, Quebec; Canada H3A 2R7


Abstract. Three sets of molecular descriptors computable from connection table information are defined. These descriptors are based upon atomic contributions to van der Waals surface area, log P (octanol/water), molar refractivity and partial charge. The descriptors are applied to the construction of QSAR/QSPR models for boiling point, vapor pressure, free energy of solvation in water, solubility in water, thrombin/trypsin/factor Xa activity, blood-brain barrier permeability and compound classification. The wide applicability of these descriptors suggests uses in QSAR/QSPR, combinatorial library design and molecular diversity work.



The pioneering work of Hansch [1] and Leo [2] was an attempt to describe biological phenomena in a "language" consisting of a small set of physical molecular properties, in particular, logP (octanol/water), pKa, and molar refractivity. Early QSAR efforts centered on deriving linear regression [3] relationships between such descriptors and biological activity. Not surprisingly, these models (first order relationships among few descriptors) were limited to analogue series. Subsequent efforts sought to increase the applicability of linear models by introducing more [4-8] (and more) descriptors. As the number of descriptors available increased, methods were required to automatically select the appropriate descriptors from a large pool [9], often, several hundred. Unfortunately, as epidemiologists are well aware: as the number of descriptors increase so too does the likelihood of finding chance relationships in the data. This has led to an increased reliance on validation methods to identify spurious models (e.g., leave-one-out or k-fold cross-validation). Of course, it is assumed that such validation methods can indeed identify spurious models in the first place (a fact which cannot be proved). The situation is even more confusing with combined variable selection and model validation methods such as CART [10]: the introduction of validation into the regression determination procedure may destroy the significance (if any) of the validation. Notwithstanding the use of complicated regression methods and descriptor selection procedures, the consistent production of effective QSAR models remains elusive. For this reason, we have chosen to return to the thinking of Hansch and Leo: by fixing a relatively small set of descriptors for use in many (hopefully all) situations we can, perhaps, a) reduce the problems of variable selection, and b) consistently produce meaningful QSAR models. Moreover, we will attempt to stay true to the Hansch and Leo concepts in order to make direct use of these ught-out descriptors.

The idea of using a fixed collection of descriptors in QSAR is related to the definition of a "chemistry space" for use in molecular diversity studies. In such work, a compound is mapped to a k-dimensional vector which is used as a surrogate when comparing compounds. Validating a chemistry space can be difficult, especially when it is proposed for diversity analysis. Two different chemistry spaces will generally induce two different notions of diversity. Unless one has a reference diversity metric for comparison, it will not be clear which space is better. Often, cluster analysis is used to justify a chemistry space: if compounds with similar biological behavior cluster together in the proposed space then it seems reasonable to conclude that the chemistry space is good. Unfortunately, these results often depend on the choice of clustering algorithm. An alternative to cluster-based justification is QSAR-based justification: if a collection of descriptors can be used to construct reasonable models of many properties of interest, using many modeling techniques, it seems reasonable to conclude that the chemistry space induced by the descriptors is meaningful for diversity analysis. Indeed, this has been the case with other efforts to construct meaningful chemistry spaces. For example, the BCUT [11] or WHIM [12] descriptor collections were designed for one purpose (e.g., diversity) but quickly found application in the other [13,14], (e.g., QSAR) and vice versa. In the present approach, we will adopt the QSAR-based justification of the chemistry space induced by the descriptors defined herein.

This document is divided into several sections. In the Methods section we will define three collections of descriptors. In the subsequent sections we apply these descriptors to the prediction of several physical and biological properties and describe further applications of the descriptors. We draw conclusions in the final section.


Approximate Surface Area. The surface area of an atom in a molecule is the amount of surface area of that atom not contained in any other atom of the molecule (see figure below). If we take the shape of each atom to be a sphere with radius equal to the van der Waals radius we obtain the van der Waals surface area (VSA) for each atom. The sum of the VSA of each atom gives the molecular VSA.

Consider two spheres A and B with radii r and s, respectively, and centers separated by distance d. The amount of surface area of sphere A not contained in sphere B, denoted by VA, is given by

The case of more than two spheres is more complicated since a portion of sphere A may be contained in several other spheres. However, we will neglect this complication (in the hope that the error introduced will not be large). Thus, we approximate the VSA for sphere A with n neighboring spheres Bi with radii si and at distances di as

where the generalized delta function adopts a value of 1 if the condition is satisfied and 0 otherwise. This formula is similar to the pair-wise approximations used in approximate overlap volume calculations [15] and approximate surface area calculations for generalized Born implicit solvent models [16]. Consider, now, a molecule of n atoms each with van der Waals radius Ri and let Bi denote the set of all atoms bonded to atom i. We will neglect the effect of atoms not related by a bond and define the VSA for atom i, denoted by Vi, to be

where bij is the ideal bond length between atoms i and j. Thus, the VSA for each atom can be calculated from connection table information alone assuming a dictionary of van der Waals radii and ideal bond lengths. In the present work the radii are derived from MMFF94 [17] with certain modifications for polar hydrogen atoms. In the results to follow, the ideal bond length bij between atoms i and j was calculated according to the formula bij = sij - oij where sij is a reference bond length derived from MMFF94 parameters that depends on the two elements involved and oij is a correction that depends on the bond order: 0 for single, 0.1 for aromatic, 0.2 for double and 0.3 for triple. Finally, the approximate VSA for an entire molecule is the just the sum of the Vi for each atom i in the molecule.

To test the accuracy of the approximate VSA calculation, a database of 1,947 small organic molecules was assembled. The molecular weights fell in the range (300,1600). For each molecule a 3D extended conformation was calculated using the 2D to 3D converter of MOE [18] version 1999.05. The MMFF94 parameter set was used to energy minimize the structures to an RMS gradient less than 0.001. Using the radii from Table 1 the molecular VSA was calculated using a dot-based method: each atom was surrounded with a large number of points on its surface. Points inside any other atom were eliminated and the remaining number of points were used to estimate the exposed surface area. Conformational analysis of several randomly chosen flexible molecules revealed that the dot-based van der Waals surface area of individual conformations differ by less than 2%. From this observation, we decided that it was reasonable to compare the approximate VSA to the dot-based surface area of a single extended conformation of each of each molecule. A scatter plot of the results is shown in the rigure at right. The correlation coefficient was r2 = 0.9666 and the relative error was less than 10%. Most of the errors occurred for the larger molecules and in molecules with many atoms in fused ring systems. No systematic error appeared to be present (other than the increase in error with molecular weight) and it was concluded that the approximate VSA calculation was reasonably accurate. We then made the inference that the individual contributions to the approximate VSA were also reasonably accurate.

Thus, we have defined, Vi, the contribution of atom i to the approximate VSA of a molecule. This contribution is reasonably accurate but has the advantage that it can be calculated much more rapidly than a 3D VSA contribution and without a 3D conformation (just connection table information). The approximate molecular VSA is very much a 2.5 D descriptor: it is (highly correlated to) a conformation independent 3D property that requires only 2D connection information.

Descriptors. Suppose that for each atom i in a molecule we are given a numeric property Pi. Our fundamental idea is to create a descriptor for a specific range [u,v) of the property values P [19]; this descriptor will be the sum of the atomic VSA contributions of each atom i with Pi in [u,v). More precisely, we define the quantity P_VSA(u,v) to be

where Vi is the atomic contribution of atom i to the VSA of the molecule (defined in the previous section). We now define a set of n descriptors associated with the property P as follows:

where a0 < ak <an are interval boundaries such that [a0,an) bound all values of Pi in any molecule. Each VSA-type descriptor can be characterized as the amount of surface area with P in a certain range. If, for a given set of descriptors, the interval ranges span all values, then the sum of the descriptors will be the VSA of the molecule. Therefore, the VSA-type descriptors correspond to a subdivision of the molecular surface area.

Wildman and Crippen's [20] recent methods for calculating logP (octanol/water) and molar refractivity (MR) provide a good basis for VSA analogs of logP and MR because these methods were parameterized with atomic contributions in mind. Both methods assign a numeric contribution to each atom in a molecule. We implemented both methods in the SVL programming language of MOE. To determine the interval boundaries we obtained statistics on a database of 44,795 small organic compounds from the Maybridge [21] catalog (the entire October 1998 HTS database less 2,000 randomly selected compounds used for testing). We chose the interval boundaries so that the resulting intervals were equally populated over the database (resulting in non-uniform boundaries). This led to 10 descriptors for logP and 8 descriptors for MR; the respective interval boundaries for logP were (-inf, -0.4, -0.2, 0, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, inf); the interval boundaries for MR were (0, 0.11, 0.26, 0.35, 0.39, 0.44, 0.485, 0.56, inf). The third set of descriptors we will define is based on atomic partial charge: the P property for each atom is the partial charge. We chose the Gasteiger [22] (PEOE) method of calculating partial charges which is based upon the iterative equalization of atomic orbital electronegativities. This method was implemented in MOE using the SVL programming language. Fourteen descriptors resulted from the use of uniform interval boundaries in (-inf, -0.3, -0.25, -0.20, -0.15, -0.10, -0.05, 0, 0.05, 0.10, 0.15, 0.20, 0.25, 0.30, inf).

We have thus defined three sets of molecular descriptors:

  • SlogP_VSAk (10) intended to capture hydrophobic and hydrophilic effects either in the receptor or on the way to the receptor.

  • SMR_VSAk (8) intended to capture polarizability;

  • PEOE_VSAk (14) intended to capture direct electrostatic interactions.

Each of these descriptor sets is derived from, or related to, the Hansch and Leo descriptors with the expectation that they would be widely applicable. Taken together the VSA descriptors define, nominally, a 10+8+14=32 dimensional chemistry space.


Self-Correlation. A database of 2,000 structures from the Maybridge HTS database (not used in the statistics collection) was selected at random to test the correlation between the descriptors we have defined. A full linear correlation matrix was calculated to estimate the correlation between the PEOE_VSA, SlogP_VSA and SMR_VSA descriptors. Between the SMR_VSA descriptors, the largest r-value was 0.6 (r2 = 0.36) which appeared once; the remaining pairs exhibited r values less than 0.27 (r2 = 0.07). Thus, MR descriptors are, for the most part, weakly correlated with each other. Between the SlogP_VSA descriptors, the largest r-value was 0.42 (r2 = 0.18) which appeared once; the remaining pairs exhibited r values less than 0.27 (r2 = 0.07). Thus, the logP descriptors are, for the most part, weakly correlated with each other. Between the PEOE_VSA descriptors, the largest r-value was 0.65 (r2 = 0.42). Thus, the PEOE_VSA descriptors are, for the most part, weakly correlated with each other. Each of the three descriptor sets described herein has shown weak intra-set correlation; however, a question remains about inter-set correlation. In the full matrix, generally the inter-correlation is weak, however, seven r values larger than 0.7 (r2 = 0.49). At first glance, the sets seem to exhibit higher correlation than in the intra-set cases. However, it must be remembered that for a given molecule each PEOE_VSA, SlogP_VSA and SMR_VSA descriptor collection sums to the VSA of the molecule; hence, there is one less dimension than the nominal 14+10+8=32.

Correlation with Other Descriptors. To test the extent to which the VSA descriptors encode other popular descriptors, a database of 1,932 small organic compounds [23] with molecular weights in the range (28,800) was assembled. For each molecule the SlogP_VSA, SMR_VSA and PEOE_VSA descriptors were calculated as well as 64 other descriptors all of which were calculated using MOE version 1999.05. For each of the latter 64 descriptors a principal components regression was calculated to produce a linear model for each descriptor as a function of the 32 VSA descriptors. In all cases 31 of 32 principal components were retained. The results are summarized in the table below. Out of the 64 descriptors tested, 32 showed an r2 of 0.90 or better and 49 had an r2 of 0.80 or better and 61 showed an r2 of 0.5 or better. These results suggest that the 32 VSA descriptors encode much of the information contained in most of the 64 popular descriptors.









































































































































Free Energy of Solvation in Water. The free energy of solvation in water is the change in free energy upon transfer from gas phase to water phase. A database of 291 small organic molecules with associated experimental free energies of solvation was created from the literature [24]. Even though the SlogP_VSA descriptors are based upon a transfer free energy it must be pointed out that the descriptor values themselves are surface areas; hence, it is quite reasonable to attempt to create such a model. For each structure the PEOE_VSA, SlogP_VSA and SMR_VSA descriptors were calculated and a principal components regression was calculated using MOE. Descriptors with small (normalized) coefficients were discarded until only the PEOE_VSA2,8,13, SlogP_VSA1,2,3,4,6,8,10 and SMR_VSA2,8 descriptors remained. A principal components regression was calculated using the remaining 12 descriptors and the resulting r2 was 0.90 with an RMSE of 0.78 kcal/mol (see the scatter plot at right). The leave-one-out cross-validated r2 was 0.89 with an RMSE of 0.82 kcal/mol. A single leave-100-out cross-validation test produced a prediction r2 of 0.88 on the 100 randomly chosen compounds left out of the training.

Boiling Point. We assembled a database of 298 small organic molecules [25] with associated experimental boiling points taken from the CRC Handbook. For each molecule the SlogP_VSA and SMR_VSA descriptors were calculated with MOE. A principal components regression was estimated to create a linear model of the boiling point as a function of the 18 descriptors resulting in an r2 of 0.93. A scatter plot of the predicted vs. experimental boiling points showed a quadratic relationship. We then created a linear model of the square of the boiling point as a function of the same descriptors. Upon taking square roots of the predicted squared boiling point the resulting predictions were well correlated with the experimental values with an r2 of 0.96 and RMSE of 15.53 Kelvin. The leave-one-out cross-validated r2 was 0.94 with an RMSE of 21.37 Kelvin. The figure at right is a scatter plot of the predicted and experimental boiling points. A single leave-100-out cross-validation test produced a prediction r2 of 0.94 on the 100 randomly chosen compounds left out of the training.

Blood-Brain Barrier Permeability. Permeability at the blood-brain barrier is an important factor in the design of safer and more effective therapeutic compounds active in the central nervous system. For a given compound, the experimental determination of the ratio of concentration in the brain and concentration in the blood is a time consuming and expensive process requiring appropriate amounts of the pure compound often in radio-labeled form. We assembled a collection of 75 compounds and experimental log BB concentration ratios from the literature [26]. Acids were deprotonated and bases were protonated. The PEOE_VSA, SlogP_VSA, and SMR_VSA descriptors were calculated for each structure. A principal components regression was performed to estimate a linear model of log BB as a function of the descriptors. Descriptors with small (normalized) coefficients were discarded until only the PEOE_VSA4,9,13, SlogP_VSA1,2,3,5,8,9 and SMR_VSA1-5 descriptors remained. A principal components regression was calculated using the remaining 15 descriptors and the resulting r2 was 0.83 with an RMSE of 0.32. The leave-one-out cross-validated r2 was 0.73 with an RMSE of 0.43. The figure at right is a scatter plot of the predicted and experimental log BB concentration ratios.

Solubility in Water. A collection of 1,438 small organic molecules with associated experimental water solubilities was assembled from the Syracuse Research Corporation (SRC) 1999 physical property database [27]. The SRC database contains experimental and estimated solubilities; accordingly, we selected all compounds with experimental values measured at 25 degrees C. The solubility measurements were converted to a log scale. The PEOE_VSA, SlogP_VSA and SMR_VSA descriptors were calculated for each structure (a total of 32 descriptors). A principal components regression model was calculated (31 principal components were retained) and the resulting r2 was 0.75 with an RMSE of 2.4. The leave-one-out cross-validated r2 was 0.74 with an RMSE of 2.5.

Vapor Pressure. A collection of 1,771 small organic molecules with associated experimental vapor pressure was assembled from the Syracuse Research Corporation (SRC) physical property database. The SRC database contains experimental and estimated vapor pressures; accordingly, we selected all compounds with experimental values measured at 25 degrees C. The vapor pressures were converted to a log scale. For each molecule the SlogP_VSA, SMR_VSA, PEOE_VSA descriptors were calculated with MOE. A multiple linear PCA regression model was calculated. 31 principal components were retained and the resulting r2 was 0.88 with an RMSE of 2.1. The leave-one-out cross-validated r2 was 0.87 with an RMSE of 2.2.

Receptor Class Discrimination. We chose the database of 455 compounds, each active against one of 7 receptors, described in Xue et al. [28] (used to develop a set of substructure keys for clustering applications). The database consisted of 7 fairly congeneric classes:

  • Class 1: Serotonin receptor ligands
  • Class 2: Benzodiazepine receptor ligands
  • Class 3: Carbonic anhydrase II inhibitors
  • Class 4: Cyclooxygenase-2 (Cox-2) inhibitors
  • Class 5: H3 antagonists
  • Class 6: HIV protease inhibitors
  • Class 7: Tyrosine kinase inhibitors

We tested the utility of the VSA descriptors for compound discrimination using the Binary QSAR methodology [29] (a Bayesian inference method of classification). A total of seven Binary QSAR models were made as follows. Model i was trained on a data set consisting of "active" molecules (those that were active against receptor i) and "inactive" molecules (those that were not active against receptor i). MOE was used to calculate the SlogP_VSA and SMR_VSA descriptors for each of the molecules in the database. A Binary QSAR model was constructed from these descriptors in an effort to predict membership in class i. The accuracy of prediction and the p value (probability of a chance occurrence) for each model was found to be:

  • Class 1: 98.7% p=0.003
  • Class 2: 96.7% p=0.043
  • Class 3: 96.5% p=0.290
  • Class 4: 98.7% p=0.001
  • Class 5: 98.7% p=0.014
  • Class 6: 98.7% p=0.012
  • Class 7: 99.1% p=0.002

Each of the models exhibited high accuracy and all but one (class 3: carbonic anhydrase II inhibitors) exhibited high significance in the chi-squared significance test. A similar classification model was built using the CART methodology with similar but lower accuracy (results not shown).

Activity Against Thrombin, Trypsin and Factor Xa. To test the VSA descriptors in linear QSAR modeling, we used a set of 72 analogs described in Bohm et al. [30] with pKi data for each of thrombin, trypsin and factor Xa. One such structure is depicted to the right. The PEOE_VSA, SlogP_VSA and SMR_VSA descriptors were calculated for each structure and a principal components regression was calculated for each receptor using MOE. For each activity model, descriptors with small (normalized) coefficients were discarded. Using the remaining descriptors a principal components regression was calculated. For thrombin, a 10 descriptor model using PEOE_VSA1,2,5,8,10,11,12 and SlogP_VSA1,5,9 resulted in an r2 of 0.65 with an RMSE of 0.61 pKi (see figure below); the leave-one-out cross-validated r2 was 0.54 with an RMSE of 0.70 pKi. For trypsin, a 9 descriptor model using PEOE_VSA1,8,11,12, SlogP_VSA0,3,4,8 and SMR_VSA5 resulted in an r2 of 0.72 with an RMSE of 0.47 pKi (see figure below); the leave-one-out cross-validated r2 was 0.62 with an RMSE of 0.54 pKi. For factor Xa, a 15 descriptor model using PEOE_VSA1,2,8,9,12,14, SlogP_VSA5,7,8,10 and SMR_VSA3,4,5,6,8 resulted in an r2 of 0.69 with an RMSE of 0.35 pKi (see figure below); the leave-one-out cross-validated r2 was 0.52 with an RMSE of 0.45 pKi.


Rather than discuss the individual results in detail, we will concentrate the discussion on the notion of the 32 dimensional chemistry space defined by the entire collection. The presented linear correlations are not extraordinarily high, and, perhaps, it is too much to expect that a relatively small set of non-3D descriptors be capable of describing a given molecule in great detail in relation to a specific property. What is interesting is the fact that the same set of descriptors was used throughout. In this respect the results suggest that the collection of VSA descriptors could be put to good use in a cheminformatics context. For example, for chemical diversity, combinatorial library design and high throughput screening data analysis.

Pearlman [31] has pointed out three problems with the use of "traditional" descriptors in the definition of a chemistry space:

  1. Orthogonality. Many "traditional" descriptors are highly correlated; a good set of descriptors should be as orthogonal as possible (indicating that each descriptor is encoding different properties from the others).

  2. Relevance. It can be argued that the "traditional" descriptors such as logP, pKa, etc. are more relevant to drug transport or pharmacokinetics rather than receptor affinity.

  3. Wholism. One might fear that the "traditional" descriptors are "whole molecule" properties which cannot distinguish the details of important substructural differences.

The orthogonality of a set of molecular descriptors is a very desirable property. Classification methodologies such as CART (or other decision-tree methods) are not invariant to rotations of the chemistry space. Such methods may encounter difficulties with correlated descriptors (e.g., production of larger decision trees). Often, correlated descriptors necessitate the use of principal components transforms which requires a set of reference data for their estimation (at worst the transforms depend only on the data at hand and at best they are trained once from some larger collection of compounds). In probabilistic methodologies, such as Binary QSAR, approximation of statistical independence is simplified when uncorrelated descriptors are used. In addition, descriptor transformations can lead to difficulties in model interpretation. Our results strongly suggest that the VSA descriptors are weakly correlated with each other. As a consequence, we expect that methodologies such as Binary QSAR, CART, Principal Components Analysis, Principal Components Clustering, Neural Networks, k-means Clustering, etc. to be more effective (when measured over many problem instances).

The notion of relevance to receptor affinity of a collection of descriptors is difficult to quantify. The assertion that "traditional descriptors (e.g., logP and pKa) are strongly related to drug transport or pharmacokinetics but are very weakly related to receptor affinity or activity..." has to be considered with some care. One says that a descriptor is strongly related to a particular property when effective QSPR models of the property have been made using the descriptor. The failure to produce a QSAR/QSPR model using a descriptor is not, in general, evidence of a lack of relevance (for example, the fault could lie with the mathematical model and not the descriptors). The relevance of descriptors must be evaluated either from theoretical considerations or long-term empirical success. Recent work [31] has indeed suggested that the underlying atomic contributions to partial charge, molar refractivity and logP are relevant to receptor affinity. Our results suggest that the presented VSA descriptors are useful not only for physical property modeling but also in receptor affinity modeling. The fact that the same set of descriptors were used with different methodologies (Binary QSAR, linear regression, CART) suggests that it is the descriptors themselves that are encoding relevant information for both classification (compound distinction) and binding affinity QSAR. Our results seem to bear out the intuition that contact surfaces describing hydrophobicity, refractivity (polarizability) and charge localization are relevant to many molecular properties, including receptor affinity. It is an added advantage that the underlying properties used in the definition of the VSA descriptors are relevant to drug transport or pharmacokinetics.

The idea that descriptor wholism is undesirable is a subtle one. It is difficult to quantify the wholism of a descriptor. A qualitative definition might be that a "whole molecule" property is one in which small bioisosteric modifications to the structure lead to large changes in the descriptor value. It is interesting to note that BCUT values, extensions of Burden numbers [33] derived from graph adjacency or distance matrix eigenvalues, are likely exhibit far more wholism than more group-additive properties (such as logP and free energy of solvation). Nevertheless, BCUT values have shown utility in QSAR/QSPR studies [34] and diversity work. Descriptors such as HOMO and LUMO energies are very wholistic and even these have been used successfully in QSAR work. Intuitively, it seems that group-additive descriptors should be better for compound classification; however, it is difficult to be sure. Notwithstanding these considerations, the VSA descriptors we have described are derived from what are widely believed to be "whole molecule" properties. It is hoped that the surface area subdivision will effect a reductionist conversion suitable for QSAR work and compound classification. The atomic VSA contributions are sensitive to connectivity and the properties considered (logP, MR, and charge) are sensitive to the chemical context of each atom. Moreover, each of the VSA descriptors is fundamentally additive in nature which suggests a more reductionist than wholist character. The high correlations seen when modeling other descriptors such as number of nitrogens, number of oxygens and number of aromatic atoms support this reductionist assertion.

If we take it as true that the presented VSA descriptors form a (relatively) low-dimensional chemistry space encoding trend information for many properties of interest we can consider cheminformatics applications. Compound selection methods based upon chemical diversity are likely to benefit from the VSA descriptors: not only are physical properties taken into account, but also properties relevant to binding, transport and pharmacokinetics. Setting aside diversity-based methodologies, we now consider the application of the VSA descriptors to High throughput screening (HTS) QSAR and focused combinatorial library design.

The automation of physical experiments through robotics to effectively perform hundreds of thousands or millions of experiments in a short time has opened the door to a large-scale approach to drug discovery. High throughput screening and combinatorial chemistry offer access to a huge set of candidate structures; however, time and economic considerations require a selection of only a subset of this vast space for physical testing. Unfortunately, most people (if not all) find it very difficult to interpret all of the HTS data when effecting a focused combinatorial library design. HTS QSAR is an alternative to human inspection of HTS data. In this alternative, a set of HTS results are considered to be "understood" if an effective QSAR model can be constructed (by effective, we mean statistically significant). The activity of new compounds (for example, in a proposed library) can be predicted with the model. The PEOE_VSA descriptors have been used quite successfully in several HTS QSAR attempts [35] using the Binary QSAR method. Accuracy levels of 40%-70% have been routinely observed on active compounds on data sets with hit rates well below 1% (inactives usually exhibit >90% accuracy). It is hoped that the SlogP_VSA and SMR_VSA descriptors will improve the accuracy levels (although the PEOE_VSA accuracy levels still resulted in significant enrichment when compared to the hit rate). Such a study will be presented in a future publication.

Suppose, now, that a statistically significant HTS QSAR model has been constructed. Further suppose that a proposed virtual combinatorial library L is made up of m substituent libraries R1,...,Rm. Thus, the number of compounds in L is |R1| x ... x |Rm|. Even moderately sized substituent libraries can result in extremely large product libraries. In such a case some method is required to select a subset of each Ri for physical synthesis. An obvious criterion is to select those members of the substituent libraries that are most likely to result in products that are active against some target. Let rij be the j-th member of the i-th substituent library (Ri). We can use a first-order ranking of the members of rij to select the most promising members. One such ranking is the probability of observing the substituent in an active product, namely

(after an application of Bayes theorem). We can assume that the members of Ri are equally likely so that this last formula simplifies to

The term Pr(active | Ri = rij) is precisely the output of the Binary QSAR methodology; that is, one can use the VSA descriptors and the HTS results as input to Binary QSAR and obtain an estimate for Pr(active | Ri = rij) for each member rij of Ri. After the indicated division by the sum of these estimates, one would obtain a probability distribution over Ri. A simple design methodology would be to retain the top scoring members of each substituent library. One advantage of this sort of probabilistic ranking scheme is that in the event that L is so large that enumeration of all structures is prohibitive, random sampling techniques can be used to estimate the required probabilities without loss of theoretical soundness.

In summary, our results suggest the utility of the presented VSA descriptors for QSAR/QSPR studies and cheminformatics applications based either upon chemical diversity or bias. The success of the descriptors in modeling various properties does not seem tied to particular numerical methodology. In particular, the use of the Binary QSAR method used with the VSA descriptors provides a statistically sound method of focused combinatorial library design.


We have defined three sets of (easily calculated) molecular descriptors based upon atomic contributions to logP, molar refractivity and atomic partial charge. The individual descriptors were found to be weakly correlated with each other (over a suitably large collection of compounds). Moreover, the chemistry space determined by the new descriptors was capable of expressing (as linear combinations) traditional QSAR/QSPR descriptors. Reasonably good QSAR/QSPR models of boiling point, vapor pressure, free energy of solvation in water, water solubility, receptor class, and activity against thrombin, trypsin and factor Xa. were built using only the new descriptors.

We conclude a) that the new descriptors are likely to be a very good starting point for QSAR/QSPR work; and b) that the collection of new descriptors may be a meaningful low-dimensional chemistry space for chemical diversity, HTS data analysis and combinatorial library design.


  1. Hansch,C., Fujita,T.; r-s-p Analysis. A Method for the Correlation of Biological Activity and Chemical Structure; J. Am. Chem. Soc., 86, 1616-1626 (1964)
  2. Leo, A., Hansch, C., Church. C.; Comparison of Parameters Currently Used in the Study of Structure-Activity Relationships; J. Med. Chem., 12, 766-771 (1969)
  3. Hogg, R.V, Tanis, E.A. Probability and Statistical Inference. MacMillan Publishing Company, New York (1993)
  4. Hall, L.H., Kier, L.B.; The Molecular Connectivity Chi Indices and Kappa Shape Indices in Structure-Property Modeling; Reviews of Computational Chemistry, Vol. 2, D.B. Boyd and K. Lipkowitz, eds. (1991)
  5. Hall, L.H., Kier, L.B.; Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information. J. Chem. Inf. Comput. Sci., 35 (1995)
  6. Balaban, A.T.; Five New Topological Indices for the Branching of Tree-Like Graphs; Theoretica Chimica Acta., 53, 355-375 (1979)
  7. Petitjean, M.; Applications of the Radius-Diameter Diagram to the Classification of Topological and Geometrical Shapes of Chemical Compounds; J. Chem. Inf. Comput. Sci.; 32, 331-337 (1992)
  8. Wiener, H.; Structural Determination of Paraffin Boiling Points; J. Am. Chem. Soc., 69, 17-20 (1947)
  9. Rogers, D., Hopfinger, A.J.; Application of Genetic Function Approximation to Quantitative Structure-Activity Relationships and Quantitative Structure-Property Relationships, J. Chem. Inf. Comput. Sci., 34 (1994)
  10. Breiman, L., Friedman, J., Olshen, R.A., Stone, C.J.; Classification and Regression Trees; Wadsworth Inc., USA (1984)
  11. Pearlman, R.S., Smith, K. M.; Novel Software Tools for Chemical Diversity; Perspectives Drug Discovery Design, 9, 339-353 (1998)
  12. Todeschini, R., Lasagni, R., Marengo, E.; New Molecular Descriptors for 2D and 3D Structures. Theory; J. Chemometrics, 8, 263-272 (1994)
  13. Stanton, D.T.; Evaluation and Use of BCUT Descriptors in QSAR and QSPR Studies; J. Chem. Inf. Comput. Sci.; 39, 11-20 (1999)
  14. Bayada, D.M., Hamersma, H., van Geerestein, V.J.; Molecular Diversity and Representativity in Chemical Databases; J. Chem. Inf. Comput. Sci.; 39, 1-10 (1999)
  15. Jones, G., Willett, P., Glen, R.C.; A Genetic Algorithm for Flexible Molecular Overlay and Pharmacophore Elucidation; J. Comp.-Aid. Mol. Design, 9 (1995)
  16. Wodak, S.J., Janin, J.; Analytical Approximation to the Solvent Accessible Surface Area of Proteins; Proceedings of the National Academy of Sciences USA; 77, 1736-1740 (1980)
  17. Halgren, T.A.; MMFF94 The Merck Force Field; J. Comp. Chem., 17(5,6) (1996)
  18. MOE: The Molecular Operating Environment from Chemical Computing Group Inc. 1010 Sherbrooke Street W, Suite 910, Montreal, Quebec, Canada H3A 2R7.
  19. The expression [a,b) denotes the half closed interval {x : a x < b}.
  20. Wildman, S.A., Crippen, G.M.; Prediction of Physicochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci., 39(5), 868-873 (1999)
  21. Maybridge Chemical Company Ltd., Cornwall, PL34 OHW England. URL:
  22. Gasteiger, J., Marsali. M.; Iterative Partial Equalization of Orbital Electronegativity - A Rapid Access to Atomic Charges; Tetrahedron, 36, 3219 (1980)
  23. Compounds were selected by molecular weight from the mref.mdb database on the MOE 1999.05 CD-ROM.
  24. Viswanadhan, V.N., Ghose, A.K., Singh, U.C., Wendoloski, J.J.; Prediction of Solvation Free Energies of Small Organic Molecules: Additive-Constitutive Models Based on Molecular Fingerprints and Atomic Constants; J. Chem. Inf. Comput. Sci., 39, 405-412 (1999).
  25. The boiling point data can be made available upon request of the author.
  26. Luco, J.M.; Prediction of the Brain-Blood Distribution of a Large Set of Drugs from Structurally Derived Descriptors Using Partial Least Squares (PLS) Methodology; J. Chem. Info. Comput. Sci., 36, 396-404 (1999)
  27. Syracuse Research Corporation, 6225 Running Ridge Road, North Syracuse, NY 13212. URL:
  28. Xue,L., Godden,J., Gao,H., Bajorath,J.; Identification of a Preferred Set of Molecular Descriptors for Compound Classification Based on Principal Component Analysis; J. Chem. Inf. Comput. Sci., 39, 699 (1999)
  29. Labute,P.; Binary QSAR: A New Method for Quantitative Structure Activity Relationships; Proceedings of the 1999 Pacific Symposium; World Scientific Publishing, Singapore (1999)
  30. Bohm, M., Sturzebecher, J., Klebe, G.; Three-Dimensional Quantitative Structure-Activity Relationship Analyses Using Comparative Molecular Field Analysis and Comparative Molecular Similarity Indices Analysis to Elucidate Selectivity Differences of Inhibitors Binding to Trypsin, Thrombin and Factor Xa; J. Med. Chem., 42, 458-477 (1999)
  31. Pearlman, R. S., Smith, K. M.; Metric Validation and the Receptor-Relevant Subspace Concept; J. Chem. Inf. Comput. Sci., 39, 28-35 (1999)
  32. Crippen, G.M.; VRI: 3D QSAR at Variable Resolution; J. Chem. Inf. Comput. Sci.; 20, 1577-1585 (1999)
  33. Burden, F. R.; Molecular Identification Number for Substructure Searches; J. Chem. Inf. Comput. Sci., 29, 225-227 (1989)
  34. Stanton, D. T.; Evaluation and Use of BCUT Descriptors in QSAR and QSPR Studies; J. Chem. Inf. Comput. Sci., 39, 11-20 (1999).
  35. Labute, P.; Unpublished work (1998,1999)