Journal Articles

Molecular Databases and MOE

by J. Demers and S. Murray

This article focuses on the general usages of molecular databases and how MOE and its graphical Database Viewer provide a quick and easy way to access, manage and display large amounts of data, be it text, numerical or structural. This article looks at how data is managed by MOE before going on to describe the Database Viewer and the types of applications it includes.

A Few Words on Databases

By definition, a database is a large collection of interrelated data. In more visual terms, a database is essentially a table where each row, or "entry," contains different kinds of information on one specific item and each column, or "field," contains the same kind of information across the many items in the table. Databases have two primary uses in MOE:

  1. They serve as custom data sets for computations.
  2. They store the output of any process that generates large amounts of data.

Data sets can be customized using MOE's import, merge or database calculator facilities, and results can be managed with its sort, selection and statistical analysis tools. To ensure consistency, the database services provide a structured and unified file format, referred to as MDB -- Molecular Database File.

MOE includes an array of applications that exploit the full potential of molecular databases:

Conformational Analysis

The MDB file is used to contain the various molecular conformations generated by way of systematic conformation search, RIPS or Hybrid Monte Carlo Trajectory Generation.

Molecular Dynamics

The MDB file is used to store the sampled trajectory as well as instantaneous thermodynamic measurements. If desired, molecules can subsequently be loaded and displayed in the 3D rendering window using the Database Browser or Molecular Dynamics Animator.

Homology Modeling

The MDB file is used to house the loop dictionary, rotamer libraries and output models of homology modeling programs.

QuaSAR Suite of Applications

The MDB file is used to store the conformations, activity data and calculated molecular descriptors.

Enter the MOE Molecular Database

In a sentence, a MOE molecular database is a binary file that stores a list of data records. What distinguishes it from other products on the market is that it combines the strength of a database with the operative work environment of a spreadsheet. Scientists can use MOE not only to import, store and manage large quantities of data but also to perform calculations on data like, for instance, calculating 2D or 3D molecular descriptors using QuaSAR.

Examples of MOE high throughput applications using molecular databases:

  • Import and clean up data
  • Generate 3D structures
  • Merge structures and biological data
  • Calculate descriptors
  • Analyze statistics
  • Clustering and subset selection
  • Fingerprinting
  • Library enumeration
  • Binary, CART and PLS QSAR
  • Focused library design

One of the tools used to analyze statistics is the correlation plot (shown below), which is part of MOE's graphical Database Viewer. Outliers were selected using the mouse.

MOE Correlation Plot

More MOE Database Features

MOE includes some of the most sought after database features in the industry:

Platform Independence

As MOE is platform independent, its molecular databases are also cross-platform which means that they can be read, saved, copied or edited on any type of machine from Sun, Dec Alpha and SGI to PC.

Easy File Import

Users can import various database formats such as Delimited ASCII, SD files, RG files and Tripos files, as well as export data from MOE into different formats.

Friendly GUI

A graphical user interface, the "Database Viewer," simplifies data input and extraction and provides visual, i.e., structural, information on molecules. For intensive computations that require a lot of time but need not be visualized, MOE can be run in a batch mode which allows users to launch the database operations from the command line, doing away with the graphical interface.

Large Data Sets

MOE can access, store and manage large quantities of data. Each cell in the database can store up to 2 gigabytes of information. Sophisticated data compression techniques are used to store 3D conformations of molecules. Topological and conformational molecular data is stored using an average of 7 bytes per atom for small molecules and 8 bytes per atom for biopolymers. For example, a database of 65,000 small molecules can be displayed with ease even on a PC. And, finally, there are no preset limits on the number of molecules in a single database.

Customization

[More on SVL]
Users can design their own database applications using the SVL programming language (Scientific Vector Language) built into MOE. SVL contains functions to manage, read and write molecular databases that can be included in SVL programs. A step-by-step demonstration is given in Using Molecular Databases in SVL Programs at the end of this article.

Graphical Database Viewer

The MOE Database Viewer is first and foremost a container for molecular conformations and related data. One of its distinguishing characteristics is that it is a direct window onto the database file on disk, i.e., it continually reflects the actual contents of the disk. For example, the MOE Molecular Dynamics simulation uses a database as its output file. A Database Viewer opened onto that database is automatically updated each time a conformation is written to the database. Sophisticated caching techniques are used to deliver real-time response even though the bulk of database data lies on disk. This enables the display of very large databases without consuming an inordinate amount of memory. For users, this means quick and easy access to data.

The Database Viewer accesses, manages and displays the three most common types of data:

  • Molecular data (3D structures or SMILES strings)
  • Strings of characters
  • Numbers (bytes, integers, floats, doubles)

Handling 3D Molecules

The Database Viewer renders molecules as 3D structures which can be rotated and zoomed in on using the mouse. Display options such as showing hydrogens, element symbols, bond orders, etc., are based on user preferences. To take a closer look at the 3D structure of a molecule or examine a protein's sequence of residues, one can copy database molecules to and from the 3D rendering window.

Examples of molecular operations performed in the Database Viewer:

  • Comparing conformations of a single database molecule using the Unified View menu command which rotates and zooms all molecules in unison.
  • Substructure searching of molecular data using one or more criteria. MOE extracts each molecule from the database and counts the number of matches to the search string that are found in the molecule. Entries that match the specified selection criterion are selected in the Database Viewer.
  • Building or energy minimizing 3D structures for a collection of molecules:
    • In a database containing a molecular dynamics trajectory, it is often necessary to minimize the energy of each conformer in order to cluster the conformations.
    • If an ASCII file containing SMILES strings is imported into a MOE database, it is necessary to calculate the 3D structures from the SMILES strings in order to calculate 3D properties of the molecules.
    • If an SD file is imported, owing to the fact that SD files typically contain 2D drawings of molecules, it is necessary to convert each 2D drawing into a 3D structure in order to obtain the 3D structures in MOE.

    Furthermore, the various conformations of the database molecules can be animated in the 3D rendering window using the MD Animator called from the Database Viewer.

Organizing and Analyzing Data

Various operations can be performed on numeric data and character strings in the Database Viewer:

  • MOE provides quick and easy sorting of text and numerical entries using one or more criteria and selection of unique entries.
  • Just like a standard hand-held calculator, the recently released Database Calculator evaluates equations using values in a molecular database. This is especially useful to apply mathematical transforms to molecular descriptors. Furthermore, users can map a custom-designed function and implement their own calculations.

    In the following example, the Database Calculator is used to calculate the negative log of all IC50 values in a database as a basis for a QuaSAR model. Using the Calculator is simply a matter of selecting the appropriate operator buttons in the panel and the field or fields to include in the equation. Results are then written to the database. In the present case, the destination field is named -log IC50.

    MOE Database
Calculator
  • One can analyze data by computing correlation plots, correlation matrices, 3D plots and histograms for selected fields. Also, the Database Viewer provides the option of plotting numerical fields in a plot area directly below the data area.
  • The Merge Databases Wizard is used to combine two databases based on common data. One of its typical uses is merging compound data originating from different sources. In addition to merging databases based on textual or numeric matching criteria, such as compound registry numbers, MOE's Merge Wizard can also work upon molecular data (using standardized SMILES strings) to merge databases.

    MOE's Merge Databases
Panel

As an example of the Database Viewer and MOE's molecular data format (.MDB), let us now look at how MOE builds the PDB: MOE Database Viewer

Shown here is a MOE molecular database containing the complete July 1999 edition (Release # 89) of the Protein Data Bank.™ Although the four CD-ROMS of the PDB contain over 10 000 compressed files requiring approximately 2 000 megabytes, the MOE MDB format requires only 440 megabytes on account of its efficient usage of disk space. As depicted in the snapshot, the database displays the 3D rendering of molecules which can be rotated and zoomed in on. Related data such as the codes and titles chains is also provided. The size of data cells can be adjusted using the mouse.

If so desired, the molecule selected in the Database Viewer (in the picture, phosphotransferase) can be loaded into MOE's 3D rendering window using the Copy to MOE command in the Cell popup menu. Click here to see the 3D rendering of phosphotransferase.

Chemical Computing Group uses the MDB format to build the extensive protein database included with each MOE release. Using SVL, MOE examines each entry for breaks and missing atoms and extracts chain data. (For more information on this topic, please see Exhaustive and Iterative Clustering of the Protein Databank.) Chains are then written into a database and further refined using MOE sort and selection utilities.

The picture below shows chain and sequence data in the Database Viewer and MOE's sort data panel.

Database Viewer and Sort
Data Panel

Using Molecular Databases in SVL Programs

Should you need to write an SVL application, it is more than likely that you will be using a database. This has the advantage that the format for methodology output is unified and can be manipulated with a common set of tools. In this case, when writing an SVL program, the first step is to open the database. Suppose, for example, that you want to open a database named confdb.mdb for reading and writing purposes. To do so, you would type the following at the command line:

	local mdb_key = db_Open ['confdb.mdb', 'read-write']; 

Here, db_Open returns the "key" of the database. A key is a number that serves to identify the database in subsequent database operations. The database key is temporary and destroyed when the database is closed. This key is necessary, for instance, to obtain the list of field names and field types in confdb.mdb as shown:

	local ['field_name', 'field_type'] = db_Fields mdb_key;

The following example demonstrates a typical use of a database: this piece of code defines a function which minimizes all small molecules in the field named mol of a given database, based on MMFF94 forcefield parameters. (Note: Colored numbers at the beginning of each line are given for explanatory reasons and are not to be included in the code. Please refer to the text below for explanations.)

(1)  function MM;

     function Min_MMFF mdb_name 
(2)	local mdb_key = db_Open [mdb_name, 'read-write']; 
(3)	local entry_key = 0;						
(4)	local mol_data;						

(5)	pot_Load '$MOE/lib/mmff94.ff';			

(6)	while entry_key = db_NextEntry [mdb_key, entry_key] loop
(7)	    mol_data = first db_ReadFields [mdb_key, entry_key, 'mol'];
(8)	    local chains = first db_CreateMolecule mol_data;	
(9)	    MM [ gtest: 0.01 ];					
(10)	    mol_data = db_ExtractMolecule chains;
(11)	    db_Write [mdb_key, entry_key, ['mol' : mol_data]];
(12)	    oDestroy chains;					
	endloop

(13)	db_Close mdb_key;
      endfunction

Explanations:

  • The first line introduces the SVL energy minimization function MM (Molecular Mechanics) (1).
  • Three variables are declared (2,3,4):
    • mdb_key: the database identifying key
    • entry_key: the variable which successively contains the entry key of each entry in the database
    • mol_data: the variable which successively contains the molecular data of each molecule in the database

    Like the database file key returned by db_Open, entry keys are used to reference each of the entries in the database. Think of the entry key as the entry's "social insurance number."

  • The first operation of the program is to load the MMFF94 forcefield in MOE (5). Each molecule in the database will then be minimized according to this forcefield.
  • Introduce a loop that will successively put into the variable entry_key every single entry key of the database (6).
  • For every entry_key, the value of the mol field is read and the information is stored in mol_data (7).
  • The molecule is then loaded into MOE (8) and minimization is performed (9).
  • Once the minimization has ended, the updated molecular information is extracted, put into mol_data (10) and written back to the database (11).
  • The current molecular system is then cleared to make room for the next minimization (12).
  • Finally, when every compound in the database has been minimized, the database is closed (13).

In Closing

When managing molecular databases, MOE combines the strong features of a database with the functionality of a spreadsheet. It is able to work with very large databases without consuming an inordinate amount of memory due to efficient usage of disk space. This combination makes for quick and easy access to large quantities of data. One can import, store and manage substantial molecular, numeric and character data and perform intensive calculations such as calculating 2D or 3D molecular descriptors. All operations can be performed using the graphical Database Viewer, which contains molecular conformations and related data, or from a command line using MOE in batch mode.

For more information on MOE's molecular database format and the Database Viewer, please contact .