Journal Articles

Binary QSAR: A New Technology for HTS and UHTS Data Analysis

P. Labute
Chemical Computing Group Inc.

The automation of physical experiments through robotics to effectively perform hundreds of thousands or millions of experiments in a short time has opened the door to a large-scale brute-force approach to drug discovery. This approach is generally called High Throughput Screening (HTS). The motivation behind this approach is to reduce, and possibly eliminate, time-consuming and costly manual interventions by physically synthesizing and testing a very large number of compounds. This HTS brute-force ideal can, perhaps, be realized when a few million compounds need to be tested; however, two factors will likely interfere with the HTS ideal:

  • The number of possible ligands. The number of stable drug candidates is not known. Estimates vary widely; however, the lowest estimate is that there at least ten trillion reasonable candidates. Even if one million candidates can be tested per day an exhaustive test of all candidates would require ten million days, or more than 27,300 years. A throughput rate of one billion measurements per day would require over 27 years.

  • The economics of HTS. The cost per HTS measurement is not negligible. The average cost of raw materials and overhead results in a rate of approximately $5 per measurement. This is certainly a substantial improvement over manual measurement; however, a one-million-test-per-day rate results in a daily expenditure of $5 million (sufficiently high to warrant an attempt at cost reduction).

These two factors strongly suggest that "Brute Force HTS" will have to become "Smart HTS" rather quickly. In other words, to reduce the total number of experiments an experiment/analysis cycle will have to be developed so that, for example, the results of an HTS run on 100,000 compounds are analyzed and used to determine the next 100,000 compounds to be tested.

It is generally accepted that the structure, composition, or physical properties of a ligand directly affect its biological activity against a target. The attempt to transform this qualitative belief into a quantitative method of activity assessment is known as the determination of Quantitative Structure Activity Relationships (QSAR). Determining a QSAR generally proceeds as follows:

  1. Define a quantitative measure of activity (e.g., the amount of ligand needed to produce an interference with the functioning of the target).

  2. Express the ligand in some quantitative manner; that is, select a collection of numbers that characterize the ligand. These numbers are called molecular descriptors or, descriptors.

  3. Determine a functional relationship between activity and the selected descriptors; that is, search for a mathematical function, f, that has the property that "activity = f (descriptors)" to a suitably high level of accuracy.

  4. Use the determined activity measure, molecular descriptors and determined functional relationship to predict the activity of new candidate ligands.

QSAR methods are used to generalize experimental data in order to design or optimize new biologically active compounds that are more potent, less toxic, more selective, or satisfy other relevant criteria. Currently, QSAR techniques are applied to relatively small data sets consisting of several tens, or perhaps several hundreds, of molecules for which activity measurements are available. These activity measurements are performed manually in the laboratory and produce relatively accurate measurements (e.g., IC50 numbers: the concentration of ligand required to attain 50% inhibition). The most widely used method of determining the functional relationship is the statistical technique of regression or least squares.

It is natural and tempting to assume that all one needs to do is apply current QSAR methodology to the large scale data sets of HTS and provide the necessary analysis portion of the proposed HTS experiment/analysis cycle. Unfortunately, two critical factors render the current QSAR technology practically useless for HTS:

  1. Precision Loss. HTS has given rise to the following trade-off: higher throughput reduces the precision of the activity measurement. Many HTS technologies report a binary condition: a candidate ligand is either "active" or "inactive." Some HTS technologies report a discrete measure; e.g., activity on a scale from 1 to 10. In either case, current QSAR technology requires a continuous activity measurement; e.g., accurate to 2 or 3 decimal places.

  2. Significant Error Rate. Many HTS techniques have the unfortunate property that the activity measurement is error prone. The error rate is significant enough to warrant special attention since current QSAR technology is very sensitive to outliers and errors. A significant error rate will neutralize the predictive capabilities of current QSAR technology.

The problems do not lie with the concepts of QSAR itself but with the underlying mathematical techniques used to determine the functional relationship between structure and biological activity. Indeed, the fundamentals of QSAR are a promising avenue for HTS data analysis.

Chemical Computing Group Inc. has recently developed (and has sought patent protection for) a new technology, called QuaSAR-Binary™, designed to analyze the binary results of HTS and make predictions regarding the biological activity of untested compounds. This new "QSAR for HTS" methodology successfully uses error-prone binary activity measurements as input. The new technology has several important and immediate applications:

  • Prioritization of HTS Experiments. Rather than test, for example, 5,000,000 compounds in a single run, break up the set of 5,000,000 compounds into lots of, say, 50,000 compounds. QuaSAR-Binary could then be used to estimate the number of active compounds, or hits, in lots that have not been tested. In this way hits are found earlier and subsequent HTS experiments are more focused. Each HTS experiment proceeds from maximal diversity in the tested compounds to minimal diversity focusing on the discovered hits.

  • Combinatorial Library Design. It is often the case that combinatorial chemistry techniques are used to create candidates for HTS experiments. Current combinatorial library design methods focus on maximizing the diversity of the resulting collection of compounds. Using QuaSAR-Binary facilitates the design of more focused combinatorial libraries: the data from an HTS experiment is used to bias the combinatorial library towards diverse, but active, compounds.

  • Virtual Database Screening. Once a QuaSAR-Binary analysis is performed on HTS results, the resulting data model is used to search for other active compounds in corporate or supplier databases. It might also be used to look for active compounds among the proprietary compounds of competitors in order to estimate the speed of their response.

  • Virtual Synthesis. A QuaSAR-Binary analysis could be used to follow reaction pathways and predict the activity of products. Current reaction databases are sufficiently rich to allow simultaneous virtual synthesis and virtual screening. As experimental data is produced by HTS experiments, new QuaSAR-Binary models are produced and used to evaluate virtually synthesized compounds.

These applications need not run sequentially; in fact, a parallel implementation would exploit the HTS data in many diverse ways:

QuaSAR-Binary is fast enough to keep pace with the HTS experiments themselves. This timely production of HTS analyses means that QuaSAR-Binary will not be the bottleneck in the HTS experiment/analysis cycle.

QuaSAR-Binary is a fundamental away from the empirically fitted functional relationship methods of traditional QSAR methodology. Rather that fitting the parameters of a model to experimental data, QuaSAR-Binary builds predictive binary models through the use of large-scale probabilistic and statistical inference. Because data fitting is not used, the predictive capacity of QuaSAR-Binary is not interpolative, but based on generalizations substantiated by the experimental data. Arguably, QuaSAR-Binary analyzes data and makes predictions similar to the way a scientist would: by examining past experience, weighing the alternatives and making a recommendation regarding what to do next.

Chemical Computing Group plans to apply this new technology to other binary criteria relevant to the pharmaceutical industry. Any True/False or Pass/Fail criterion is subject to QuaSAR-Binary analysis provided that there is sufficient experimental data available. The possible applications "is drug-like", toxicity, and bioavailability. The generality of this new technology and its technical foundations will allow it to be applied to a wide variety of data analysis problems. QuaSAR-Binary will have a profound impact not only on accelerated drug discovery but also the analysis of complex biological systems.