Journal Articles

SD File Processing with MOE Pipeline Tools

Alex M. Clark and Paul Labute
April 2008
Chemical Computing Group Inc.
 
Introduction  | sdwash  |  sdfilter  |  sdsort  |  sddesc  |  Summary  |  References
 

Introduction

In the past decade or so, the use of graphical user interfaces for computational life science software has become the norm. This is mainly due to the popularity of computers with the Macintosh OS and Microsoft Windows® operating systems. Even the Linux and older Unix workstations have developed their graphical interfaces to a reasonable degree. Windows, dialogs and menus have largely replaced "switches" and other more arcane command-line syntax because they are easier to use and remember. However, there is a renewed interest in command-line programs due to the popularity of pipeline and integration environments such as Scitegic's PipelinePilot®, Inforsense's KDE™ and the Konstanz Information Miner KNIME™. These systems provide a graphical "workflow" or flow chart style model in which computing elements are connected together with lines (or pipes) through which data flows:

In such systems the graphical interface helps connect different programs together and hides the details of the individual computing elements or programs. In a workflow model, command-line tools are easier to use and integrate than the more traditional graphical user interface programs. In some sense, workflow systems are merely interfaces to the Unix-style pipes that remember all of the details of the syntax needed to run the individual programs. As long as the individual programs being connected have a common exchange format they can be connected in complex workflows quite easily.

In the Molecular Operating Environment (MOE), there are a number of programs intended to be used in such environments (as well as on their own). These programs are the "SD Tools" which read and write MDL's Structure-Data (SD) files [MDL 2007]. Because of the standard input and output SD format, the MOE SD Tools can be piped together or with other programs that read and write SD files. The MOE SD Tools

  • are executable from MOE/batch, the non-graphical MOE;
  • exploit many cheminformatics algorithms found in MOE;
  • read one or more SD files and produce SD output files;
  • can be linked together using pipes;
  • can work with very large (greater than 4 Gb) files;
  • can be invoked using the command line, within scripts, in web applications, or from third party software.

The SD format is, perhaps, the most reliable means of communicating (small molecule) chemical structures. There are other formats; however, most life science software will pay particular attention to getting SD files correct and most chemical database management systems use MDL software.

In this document, we will describe four of the MOE SD Tools:

  1. sdwash prepares SD files by carrying out a number of operations on the molecular data field, which include 2D depiction layout, hydrogen correction, salt and solvent removal, chirality and bond type normalization, tautomer generation, adjustment and enumeration of protonation states, and expansion of fragment abbreviations.

  2. sdfilter performs selective filtering of SD files, removing molecules which do not meet certain criteria, such as druglike/leadlike characteristics, or have calculated properties which fall outside of a specified range; e.g., acceptor/donor count, rotatable bonds, molecular weight, log P, etc.

  3. sdsort sorts SD files according to the molecular structure or by a data field, and can remove duplicates (taking tautomers into account) or compute differences between SD files.

  4. sddesc allows the calculation of some or all of the MOE molecular descriptors for each molecular entry, with the results stored in corresponding SD file data fields.

Collectively the MOE SD Tools provide functionality for maintaining catalogs of molecules, preparing structures for subsequent computational drug discovery applications, reducing database size and generating calculated properties for further analysis.

sdwash - Molecule Cleanup and Protonation State Enumeration

Typically, a collection of SD files will contain sketches of molecules drawn by different people using different stylistic conventions (especially if these SD files are from different compound vendors). In addition, the molecules will be drawn in some neutral form which is often not representative of the substance in solution or bound to a receptor. The sdwash program is designed to automatically identify and correct a number of common stylistic problems and to produce relevant protonation states for each molecule. The program can (among other things)

  • Identify and/or translate group abbreviations into explicit structures; e.g., tBu, Boc, etc.
  • Enumerate ionization and tautomer forms.
  • Normalize stereochemistry notation (wedges and parities).
  • Normalize ylide and hypervalent bond notation.
  • Depict chemical structures using a high quality structure layout method.
  • Disconnect covalent salts and/or remove salt molecules.
  • Add or remove explicit hydrogen atoms.
  • Assign the contents of the "molecule name" associated with each SD file entry and/or remove SD data fields.

Like the other SD tools in MOE, sdwash is a command-line (pipeline) program that reads one or more SD files and writes out an output SD file (and possibly an SD file containing problem structures). The program can be used on its own or by chaining (piping) the input/output from one SD tool program to another; for example, "wash then filter" prior to writing the output SD file to disk.

Abbreviation Translation. The tendency to include functional groups as free-form abbreviations is a serious impediment to cheminformatic interpretation of SD files. By default, the MOE SD tools will ignore group abbreviations, and interpret the structures as best as possible, using a temporary placeholder atom for unknown symbols (usually a carbon atom). However, sdwash can accept a list of abbreviation translations in the form of name=fragment, where the fragment is a SMILES string [Weininger 1988], the root atom of which will be used to replace the atom in question. For example,

-abbrev D=[2H]
-abbrev T=[3H]
will convert deuterium and tritium symbols to the explicit isotopic forms and
-abbrev NO2=[N+](=O)[O-]
-abbrev OMe=OC
will convert nitro and methoxy abbreviations to explicit atom notation. Abbreviations are only substituted on terminal groups. When an abbreviation match is found, the corresponding fragment is embedded using the MOE depiction layout algorithm [Clark 2006] and is oriented onto the molecule:

It is often the case that a particular SD file will contain abbreviations that are not known in advance. Such structures can be identified using the "strict" mode of sdwash. Chemical structures for which unmatched abbreviations still remain can be filtered out for manual inspection and updating of the abbreviations list. The following workflow can be applied to an incoming SD file source:

Molecules containing all-element atoms, or which can be converted into such, will proceed directly to the results collection, while those with unknown abbreviations will be stored in the filtered collection. After updating the abbreviations list, the filtered collection can run through sdwash (treating it as a new source). For convenience, MOE includes a file containing a list of common abbreviations which may be used as-is, or modified as necessary.

Protonation State and Tautomer Enumeration. Typically, molecules in SD files are substance sketches or are drawn in forms that correspond to analytical compound verification. These structures often are not representative of the compounds in solution or as they might appear when bound to a receptor. The sdwash program provides three optional methods to address this issue:

  1. Tautomer enumeration. Generate all reasonable tautomeric forms.
  2. Protonation state enumeration. Generate all reasonable titration states.
  3. Strong acid/base filtering. Retain only ionized forms of strong acids and bases.

These procedures (to be described shortly) are rule-based and the rule set is intended to be robust and general. Rule-based enumeration methods were chosen (rather than a fixed scoring method) because a hypothetical receptor may create local pKa shifts substantially different from aqueous solution.

The sdwash program is capable of enumerating tautomers of input molecules in a variety of states, which uses a collection of tautomer transformation rules derived from Selzer [Selzer 2006]. The full collection of structures related by facile hydrogen migration is enumerated (possibly up to a maximum set size). The rules include common hydrogen shifts, such as keto-enol:

and hydrogen shifts across heterocycles:

and various functional group rearrangements:

While many tautomers are quite similar with regard to their anticipated properties, there can be profound differences in geometry, for example:

The tautomer on the left has the normal configuration for substituted pyridine rings, while that on the right has an unusual hydrogen placement, which has the benefit of being able to form a tight six membered ring involving a hydrogen bond, which could lead to a profoundly different 3D geometry.

An important effect of the presence of facile tautomeric interconversions is the lack of preserved stereochemistry at the sites involved. Consider the following two species:

The species on the left is the (R) enantiomer. However, since it is readily interconvertible with the species on the right, in which the prochiral center is planar, the chiral center is effectively racemic in aqueous solution. Conversely, the species on the right is the (Z) stereoisomer, but since the double bond is not preserved in all of the tautomers, the possibility for stereoinversion exists in this case also. Enumerating the tautomers of either of these starting points will produce results in which specific stereochemistry annotation at the affected points will have been removed (marked as "unknown").

While tautomer forms always retain the same net charge, an additional mode also produces equivalent resonance forms, and titrated species which interchange H+ with the solution (hence the term protomer). The rules which define when protons may be gained or lost are loosely based on known pKa values for organic functional motifs. Because the result is a set of molecular species, rather than a single choice, borderline cases are included in the output. This is of particular importance for ligands binding in protein active sites, as regional effective pKa may be significantly different from that of the solvent, and this cannot be predicted until the binding mode is known.

The histidine sidechain, for instance, has two common neutral tautomers, and one protonated variant (which has two major resonance forms, one of which is shown):

The anionic form of histidine is not generated, as the titration rules consider the deprotonated form to be unfavorable, and err on the side of concise output. The similarly substituted tetrazole, however, favors the deprotonated anion:

Small molecules which exhibit zwitterionic protonation behavior, as well as tautomers, can quickly multiply out the number of results, even when topological and resonance-equivalent forms are removed. (By default the tautomer and protomer enumeration steps exclude molecules which are topologically equivalent, but this may be disabled, which is appropriate when the incoming molecules already have meaningful 3D geometry.)

For this reason, the rules for allowed tautomer and protomer enumeration have been carefully crafted in order to produce only those forms which might be reasonably expected to exist in substantial population in aqueous solution.

Optionally, sdwash will retain only ionized forms for strong acids and bases such as carboxylic acids, aliphatic amines, acyl sulfonamides, tetrazole, etc. This procedure can be used to eliminate protonation states containing carboxylic acid in favor of carboxylate. In addition, this procedure can be used even when tautomer and protonation state enumeration is disabled. While this method does not deal with borderline acidic/basic groups, nor correlated protonation, the results are generally more representative of organic molecules in aqueous solution than conventional sketch forms, which are usually drawn in their neutral state. The sdwash uses a predefined set of strong acid/base protonation rules; however, it is possible to specify custom protonation/deprotonation rules to replace or supplement the defaults. by specifying SMARTS strings to identify atoms which should be protonated, deprotonated or left alone. For example, the command line option

-rule none:[OH][NX3][CX3](=O)[#6]

would disable the deprotonation of hydroxamic acid.

Stereochemistry normalization. SD files have two redundant methods for specifying chirality at tetrahedral centers: chiral parity and wedge bonds. Many computer programs rely on one or the other specification and do not examine both methods; consequently, stereochemistry annotations may be lost or misinterpreted. Worse, in some cases conflicting annotations are written to SD files. The sdwash program can optionally recreate either parities or wedges if one is present and the other is not:

Chiral parity may be specified as even, odd, unknown or not applicable. Even and odd are conceptually similar to the "R" and "S" notation, except atom indices are used instead of CIP priority values. The chiral parity values can be derived from wedge bonds, except in rare cases where the hints provided by the flat drawing and up/down bonds does not determine an appropriate geometry. If instructed, sdwash will calculate missing parity values for chiral centers when wedge bonds are available. If instructed, in cases where chiral parity values are given, but no wedge bonds are drawn, the wedges are calculated and added to the drawing in an aesthetically reasonable manner.

Normalization of ylide and hypervalent bond notation. If requested, the each will be interpreted in the context of charge-separated bonds; e.g., a single [X+]-[Y-] being equivalent to a double bond X=Y, in some cases. Such bonding arrangements will then be converted back to conventional norms used in sketch representations. The diagram below shows examples of valid ways to draw a nitro functional group, fully protonated phosphate, and a Wittig reagent:

In each of the three cases, the left-most form chosen as being the most easily recognizable aesthetic representation, and alternative forms are converted to the left-most version regardless of the incoming notation.

Depiction Layout. If requested, the wash tool will replace the coordinates of each chemical structure with those computed by the MOE 2D depiction layout algorithm [Clark 2006]. The resulting molecular structures will then be suitable for presentation by appropriate sketch rendering software. The depiction option is appropriate for use regardless of the condition of the original coordinates; i.e. molecules with 2D or 3D produce the same results Details of the MOE 2D depiction layout algorithm can be found in http://www.chemcomp.com/journal/depictor.htm".

Covalent salt disconnection. Group I metals which are drawn in covalently bonded form may be severed in order to produce the more common ionic representation:

The disconnection of salts is limited to simple cases such as that shown above. Non Group I metals cannot safely be modified, nor can metals with more than one ligand. If desired, non-major molecular components may be removed. This is usually equivalent to removing salts and adducts, though care must be taken if there is any possibility that the component of interest is not that with the largest number of heavy atoms. Optionally the removed components may be stored in an SD data field by their SMILES strings, for example:

input output
> <salt>
[Cl-].[Cl-].O

Explicit hydrogen atoms. Explicit hydrogen atoms can be added or removed by sdwash. New hydrogen atoms may be added to heavy atoms which are determined to have less than their allotted hydrogen quota (according to valence rules). When removing hydrogens, only those which can be unambiguously re-determined will be removed, which means that inorganic elements, or elements in unusual bonding environments, will not have their hydrogen atoms deleted.

In the above example, the first structure is completely hydrogen suppressed. A total of 6 hydrogen atoms are added by sdwash. Given the second structure as input, only 5 of these hydrogen atoms would be removed by the deletion feature, on account of being relatively unambiguous. The hydrogen atom attached to SVI is left intact, for if it were removed, it might be recreated in different ways by different software, e.g. no implicit hydrogens added due to the higher-than-octet Lewis valence, or some computer programs may choose to add a hydrogen atom to an oxygen atom instead, producing the tautomeric S(=O)O[H] form.

Molecule names and data fields. Each entry in an SD file contains a "molecule name" field and possibly associated data fields such as registration numbers or property values. The sdwash program can replace the molecule name field with contents of a specified data field (e.g., the registration ID); also, the original molecule can be stored in a data field, in the form of a SMILES [Weininger 1988] string.

Normally, sdwash retains all SD file data fields for each molecule; however, sdwash has logic for removing specific fields, or patterns of fields, which may be supplied on the command line. Additional field data may be generated by various options, for example SMILES strings for molecular content, index identifiers and calculated descriptors. SD files with minor formatting errors in the data block are quite common, and can cause problems with some less robust reading software. The MOE SD tools are designed to be tolerant of format abuses on input, but strictly correct on output.

Example of sdwash. SD files, particularly those from independent catalog vendors, can be expected to exhibit some stylistic variety. For example, the representation of ylides, salts, adducts, explicit hydrogen atoms, wedge bonds vs. chiral parity codes, tautomers and use of mnemonic abbreviations can cause two equivalent molecules to be interpreted differently. Also, some nonstandard uses of the SD format can be corrected during an initial wash step.

The following Unix command uses sdwash used to produce a washed catalog from a compound vendor:

$ sdwash catalog.sdf -o catalog_washed1.sdf -f catalog_filtered.sdf   \
        -removeH -ylide -chiral -wedge                                \
	-salts -compfield removed_comp                                \
	-strict @$MOE/lib/sdabbrev.txt -verbose

One input file is read, catalog.sdf, and two files are produced, catalog_washed1.sdf and catalog_filtered.sdf.

The -removeH flag causes explicit hydrogens to be removed if they are completely unambiguous. -ylide normalizes certain types of charge separated vs. double bond notation; for example, three ways to draw nitro groups include [N+]([O-])=O, [N+2]([O-])[O-] and N(=O)=O, all of which will be converted into the former. Chiral centers may be denoted in two ways in the MDL MOL file format: by wedge bonds, and by atom-centered parity values. The -chiral flag calculates missing parities for chiral centers if wedge bonds are available, and -wedge calculates missing wedge bonds for chiral centers if parity values are available. The -salts flag causes covalently bound group I metals to be disconnected to make the preferred ionic form, and -compfield activates the removal of minor connected components, and causes those removed components to be stored as SMILES strings in the field removed_comp.

The -strict option causes molecules which have atom symbols which cannot be resolved into elements will be written to the catalog_filtered.sdf file instead of the main output catalog_washed1.sdf. The item @$MOE/lib/sdabbrev.txt includes a standard set of abbreviation translation rules so that common abbreviations such as Me, Et, Ph, etc. will be converted into their corresponding full chemical representations. (In general, the @ character specifies a file from which to read further command-line options to a MOE SD Tool; this facility provides a convenient means to save commonly used options.) The -verbose flag causes unknown abbreviations to be written to the terminal to facilitate the updating of the abbreviations list. Once the abbreviations list is updated, the catalog_filtered.sdf file can be reprocessed with

$ sdwash catalog_filtered.sdf -o catalog_washed2.sdf                  \
        -removeH -ylide -chiral -wedge                                \
	-salts -compfield removed_comp                                \
	-strict @$MOE/lib/sdabbrev.txt @new_abbrev.txt -verbose

which will produce catalog_washed2.sdf with the previously failing structures processed with the abbreviations in new_abbrev.txt. The two files catalog_washed1.sdf and catalog_washed2.sdf can be merged with the Unix cat program

$ cat catalog_washed1.sdf catalog_washed2.sdf > catalog_washed.sdf

if desired, or both files can be retained and used as input to other SD tools.

sdfilter - Elimination of Undesirable Compounds

Typically, large SD files (especially from compound vendors) contain undesirable compounds such as inorganic compounds, reactive species and overly flexible compounds. The sdfilter program removes records from an input collection of SD files based upon programmable criteria and writes the results to an output SD file. The fundamental operation of sdfilter is

  1. Read and concatenate all specified input files.
  2. Write records that satisfy the specified filters to the output file.
  3. Write records that don't satisfy the filters to the filter file.

The following molecular properties can be tested by sdfilter. (In the table, the range type means a range of values encoded by X,Y for "X through Y", X+ for "X or more", or X- for "X or fewer")

Property Option Type Description
Drug-like -druglike boolean The molecule is drug-like according to [Lipinski 1997]
Lead-like -leadlike boolean The molecule is lead-like according to [Oprea 2000]
Nonreactive -nonreactive boolean The molecule is not reactive according to [Oprea 2000]
Small Rings -smallring boolean The molecule contains no ring of size 9 or more.
Chiral Atoms -chiral range The number of chiral centers in the molecule.
Elements -elements list The complete list of allowed elements for each molecule; e.g., H,C,N,O,F,S,P,Cl,Br,I.
H-bond Atoms -donacc range The sum of hydrogen bond donors and acceptors in the molecule.
H-bond Acceptor -acc range The number of hydrogen bond acceptors in the molecule.
H-bond Donor -don range The number of hydrogen bond donors in the molecule.
log P(o/w) -logp range The log of octanol/water partition coefficient [Wildman 1999].
log S -logp range The log of the water solubility coefficient [Hou 2004].
Molar Refractivity -mr range The molar refractivity [Wildman 1999].
Molecular Weight -mw range The average molecular weight of the molecule.
Racemic Atoms -racemic range The number of chiral centers without stereochemistry specification.
Rotatable Bonds -rotatable range The number of rotatable non-ring bonds.
Polar Surface Area -tpsa range The topological polar surface area [Ertl 2000].
Substructure Count -smarts pattern range The number of occurrences of a given SMARTS substructure.

Additionally, specific ranges of entries in the input SD stream may be selected; for example, records 10,000 through 20,000 may be extracted. Optionally, the reason for rejecting a particular structure will be recorded in an SD file data field.

Examples of sdfilter. A common use of sdfilter is to filter out compounds from an SD file that are not leadlike or unsuitable for high throughput screens or biologically undesirable. The following sdfilter command can be used to apply a reasonable leadlike filter to an SD file catalog.sdf and produce an SD file catalog_leadlike.sdf

$ sdfilter @leadfilter catalog.sdf -o catalog_leadlike.sdf 

where the file leadfilter contains rules (the sdfilter options) to exclude compounds that don't pass Oprea's leadlike test, contain reactive groups, too many chiral centers, a ring of size 9 or more, too many racemic centers, or contain 3 or more >CF2 groups in a chain, 5 or more >CH2 groups in a chain, sulfates, nitrates, 7 or more fluorines, 4 or more halogens (other than fluorine), or 2 or more nitro groups:

-leadlike
-nonreactive
-smallring
-chiral 4-
-racemic 3-
-smarts [CX4Q4](F)(F)[CX4Q4](F)(F)[CX4Q4](F)(F) 0
-smarts [CX4Q2]!@[CX4Q2]!@[CX4Q2]!@[CX4Q2]!@[CX4Q2] 0
-smarts [OX2Q2][SX4]([OX1])([OX1])O 0
-smarts [OX2Q2][N+Q3](=[OX1])[OX1] 0
-smarts F 6-
-smarts [#G7!F] 3-

Note that the range specifier X- means "X or fewer".

Of course, the results of the sdfilter can be combined with sdwash using Unix-style pipes to remove salts and enumerate protonation states:

$ sdwash catalog.sdf -strict -ylide -salts @$MOE/lib/sdabbrev.txt     \
        | sdfilter @leadfilter                                        \
        | sdwash -component -protomers -acidbase -o titrated_leads.sdf

which produces a titrated_leads.sdf file suitable for computational studies such as docking or conformational analysis. Protomer enumeration is particularly important for calculations based on molecules which have ambiguous protonation states which cannot be assigned with confidence given that the binding mode is unknown; for example piperazine functional groups:

typically have two viable protonation sites:

For this molecule, the neutral form, and both the singly protonated forms, are produced, while the 1,4 diprotonated form is not considered. Enumeration of multiple tautomers is also important for many calculations, since different pharmacophoric features can be matched depending on the exact location of mobile hydrogens. For example, uracil produces six topologically distinct neutral tautomers, all of which have quite different hydrogen donor/acceptor patterns:

The -acidbase option eliminates neutral forms of strong acids and bases so that, for example, neutral primary amines will be suppressed from the output.

sdsort - Duplicate Compound Detection and Elimination

Very often, one or more SD files will contain duplicate compounds that, for one reason or another, need to be removed. There is little need to waste disk space or computer cycles on duplicate compounds. However, the notion of "duplicate" is non-trivial. For example, to a biological receptor, sodium benzoate and benzoic acid may look the same (binding phenylcarboxylate, say); however, the two substances are different and may produce different assay results. In such a case, the two substances should be treated as "different". On the other hand, in a computational docking study, their solvated major component (phenylcarboxylate) will produce the same computationally derived docked poses. In this latter case, the two substances should be treated as "identical". Similar reasoning may apply to tautomeric forms of two compounds; that is, an SD file catalog may contain the same compound but in two tautomeric forms - there is little point in purchasing both substances from the supplier.

The sdsort program is designed to detect and possibly remove duplicate compounds under a variety of situations and definitions of "duplicate". Such detections and removals are conducted using large-scale sorting techniques - hence the name sdsort. Essentially, the sdsort program sorts the contents of a concatenated sequence of SD files (possibly removing duplicate records and merging the related per-molecule data fields) and writes the results to an output SD file. SD file records not written to the output may optionally be written to a "filter" SD file for further inspection. The fundamental operation structure of sdsort is:

  1. Read and concatenate all specified input files (including very large files).
  2. Order the contents of the concatenated files (usually by the σ skeleton — the chemical graph).
  3. Write output records (possibly removing duplicates and merging data fields).

By default, sdsort sorts the concatenated input SD files by the σ skeleton. This order is not sensitive to the methods of molecule comparison so that, for example, the same ordered file can be used to remove exact SMILES duplicates or tautomer duplicates. For example,

$ sdsort catalog.sdf -o catalog_sorted.sdf 

will sort the SD file catalog.sdf into a σ skeleton ordered output file catalog_sorted.sdf. A given file may be checked for sorted status with a command line option:

$ sdsort -check catalog.sdf catalog_sorted.sdf 
unordered catalog.sdf
ordered catalog_sorted.sdf

Two ordered files can be merged efficiently with the -merge option

$ sdsort -merge catalog1_sorted.sdf catalog2_sorted.sdf -o catalog.sdf

which is equivalent to

$ sdsort catalog1_sorted.sdf catalog2_sorted.sdf -o catalog.sdf

but much faster since the input files have already been sorted. Typically, each catalog from a vendor would be maintained in a separately sorted file (since catalog updates can be handled easily). A collection of σ-skeleton ordered files can be used with sdsort -merge to efficiently produce non-duplicate collections under various definitions of duplicate.

The -unique option of sdsort is used to eliminate duplicates. Molecules can be compared in one of several ways, specified with the -molcmp option:

Option Description
-molcmp smiles The smiles method compares molecules by comparing the canonical (unique) SMILES [Weininger 1988] string calculated for each molecule. This method is a kind of "exact" comparison and is sensitive to bond order, formal charge, resonance, ionization and tautomer form of each molecule - all resonance forms of methyl imidazolium and each tautomer of methyl imidazole will be considered different.
-molcmp zqh The zqh method compares molecules by comparing their ZQH labeled chemical graphs: each molecule is converted to a graph with atom labels consisting of the atomic number, number of heavy neighbors and number of (possibly implicit) attached hydrogen atoms. The molecules are considered identical if the graphs are identical (excepting atom order). This method is a kind of "exact" comparison but is not sensitive to bond order, formal charge, resonance but is sensitive to tautomer and protonation state. The resonance forms of methyl imiazolium will considered the same but the different tautomer forms of methyl imidazole will be considered different.
-molcmp tautomer The tautomer method compares molecules by checking to see if they have a tautomer in common (using the sdwash methodology). If two molecules have a tautomer in common then the two molecules are considered to be the same. This method is not sensitive to tautomer or resonance form but is sensitive to total charge. Thus, the tautomer method is good for neutral substance comparison. The different tautomers of methyl imidazole are considered the same but different from methyl imidazolium.

In each of the methods, stereochemistry as annotated in the SD file is taken into account. Thus, the command

$ sdsort -unique -molcmp smiles catalog1.sdf catalog2.sdf -o catalog_usmi.sdf

would produce the file catalog_usmi.sdf consisting of molecules no two of which have the same SMILES string. The duplicate compounds not written to the output catalog_usmi.sdf can be stored in a "filter" file specified with the -f option;

$ sdsort -unique -molcmp smiles catalog1.sdf catalog2.sdf             \
        -o catalog_usmi.sdf -f catalog_dsmi.sdf

which will save the removed compounds in the file catalog_dsmi.sdf. In general the first copy of a molecule is the one retained and sent to the output file; thus, preferred vendors would be listed first in the list of input SD files. (When duplicate removal is enabled, sdsort can concatenate SD data fields or retain only the first-encountered set of data fields.)

Comparison of unique SMILES strings does not, however, differentiate between chemical species which are tautomers of each other, and are hence equivalent in solution, and not suitable for purchasing twice. Also, some stereocenters are interchangeable in solution at room temperature, for example chiral centers adjacent to a site of tautomerisation, and many C=N bonds. These duplicates can be unmasked by using the tautomer comparison mode:

$ sdsort -unique -molcmp tautomer catalog1.sdf catalog2.sdf           \
        -o catalog_unique.sdf -f catalog_dups.sdf

In this example, molecules which contain no tautomeric/dubious stereochemical equivalents are written to catalog_unique.sdf, while molecules which are found to belong in a set of size greater than one are written to catalog_dups.sdf. The sdsort can write the duplicates to a "filter file" specified with -f for subsequent manual inspection.

Sorting of large SD files is efficient, and not constrained by memory. The source data is divided up into file-based chunks prior to the final ordering and emitting to the output stream. Databases containing tens of thousands of records can be sorted on the order of minutes, using contemporary hardware. The overhead required to remove duplicates sets is generally very low for databases which do not contain a large number of near-duplicates.

Example. In the following example we use sdwash and sdsort with Unix-style pipes to detect SMILES and tautomer duplicates in a compound catalog.

$ sdwash catalog.sdf -strict -ylide -salts @$MOE/lib/sdabbrev.txt     \
    | sdsort -unique -molcmp smiles -f catalog_smidups.sdf            \
    | sdsort -merge -unique -molcmp tautomer -f catalog_tautdups.sdf  \
    > catalog_unique.sdf 

This command will read the catalog SD file catalog.sdf, normalize ylide notation, disconnect salts and replace group abbreviations using the $MOE/lib/sdabbrev.txt file and discard structures with unrecognized abbreviations and other anomalies. The output of the sdwash is piped directly into the first sdsort step which will sort by σ skeletons, remove SMILES duplicates and write these duplicates to catalog_smidups.sdf. The output of the first sdsort is then piped directly into the second sdsort which is given the -merge option (since the file is already sorted by the σ skeletons) and tautomer duplicates are removed and written to the file catalog_tautdups.sdf. The output of the second sdsort is then redirected to the catalog_unqiue.sdf file, which is the formal output of unique compounds.

Executing the above sequence of catenated pipes on the Maybridge High Throughput Screening database (October 2007) took slightly over 7 minutes on a 2 GHz AMD x86 computer. The sdwash read 56,842 records, of which 21 were filtered due to unrecognized abbreviations. The first sdsort step read 56,821 records, of which 78 were combined with other records on account of having identical unique SMILES strings. The second sdsort step read 56,743 records, of which 114 were diverted to the filter file on account of having duplicates, as detected by the tautomer comparison. Each of these duplicates occurred in pairs. 52 of these duplicate pairs were found to be tautomers of one another, and 5 were due to questionably different stereochemistry. Several examples are:

Tautomers:
Stereochemistry:

In each of the above cases, a SMILES comparison failed to detect the duplication since SMILES is sensitive to tautomeric state and stereochemical annotations near tautomeric sites.

sddesc - Molecular Property Calculations

Molecular property values can be important for decision making regarding chemical compounds [Lipinski 1997], in QSAR/QSPR studies and in compound database systems which store "standard" properties directly in the database. In SD files, each molecule may contain additional data fields which can be used to hold molecular properties. The purpose of the sddesc program is to calculate molecular properties (typically derived from the molecular graph) and store them in SD file data fields. The operational structure of sddesc is

  1. Read and concatenate all specified input files.
  2. Calculate the desired molecular descriptors.
  3. Write records augmented with calculated descriptors to the output.

The sddesc program can calculate any of the descriptors in MOE and a complete list is beyond the scope of this document. However, the following table summarizes a number of the properties that can be calculated with sddesc

Symbol Description
lip_violation
lip_druglike
The Lipinski rule of five violation count and drug-like assessment [Lipinski 1997].
opr_violation
opr_leadlike
The Oprea rules violation count and lead-like assessment [Oprea 2000].
reactive A flag indicating if the molecule is reactive according to predefined rules.
BCUT_*
GCUT_*
The Burden number type descriptors derived from properties and graph eigenvalues.
apol
bpol
The sum of the atomic polarizabilities (apol).
The sum of the absolute difference of atomic polarizabilities on each bond (bpol).
Weight The molecular weight.
density The molecular mass density (molecular weight / van der Waals volume).
Fcharge The total charge of the molecule (sum of formal charges).
SMR
mr
The molar refractivity [Wildman 1999].
SlogP
logP(o/w)
The log of the octanol/water partition coefficient [Wildman 1999].
logS The log of aqueous solubility (mol/L) [Hou 2004].
TPSA The topological polar surface area [Ertl 2000].
vdw_vol The van der Waals volume.
vdw_area The van der Waals surface area.
SlogP_VDA*
SMR_VSA*
The VSA descriptors [Labute 2000] derived from subdivisions of property mapped surface areas.
a_n*
a_count
Various element and atom counts.
a_aro
b_ar
Number of aromatic atoms or bonds.
b_1rotN
b_count
Rotatable bond and bond counts.
nmol The number of connected (molecular) components.
rings The number of rings in the molecule.
chi*
Kier*
The Kier and Hall topological descriptors [Kier 1991].

Properties that depend on 3D coordinates must be given SD files containing 3D data. For such files, quantum mechanical properties can be calculated like HOMO, LUMO, and Ionization Potential using semi-empirical Hamiltonians such as AM1 and PM3 as well as molecular 3D volumes and surface areas.

The properties to calculate are specified to sddesc by their symbol names (in the above table). For example,

$ sddesc catalog.sdf -calc Weight,SlogP,logS -calc -o catalog_prop.sdf
will add SD data fields containing molecular weight, the Wildman logP and Hou logS values to each structure:
...
M  END
...

>  <mw>
490.58

>  <SlogP>
0.0438

>  <logS>
-4.13

$$$

The sddesc program is typically used after sdwash has prepared the structures, although, some properties should be evaluated on neutral structures (such as logP on acids and bases).

Summary

We have described four programs in the Molecular Operating Environment that are designed specifically to operate on MDL's Structure-Data files (SD files):

Each of the programs is written in the SVL programming language and functions as a Unix-style pipeline tool. Consequently, these programs can be integrated in pipelining environments such as SciTegic's PiplinePilot®, Inforsense's KDE™, KNIME, and Microsoft Windows® Explorer. The SD tools are capable of handling very large collections efficiently and use sophisticated chemical interpretation algorithms to effect their calculations. This makes them suitable for both chemical information systems and preparation for molecular modeling calculations.

References

[Clark 2006] Clark, A.M., Labute, P., Santavy, M.; 2D Structure Depiction; J. Chem. Inf. Model. 46 (2006) 1107-1123; see also http://www.chemcomp.com/journal/depictor.htm.
[Wildman 1999] Wildman, S.A., Crippen, G.M.; Prediction of Physicochemical Parameters by Atomic Contributions; J. Chem. Inf. Comput. Sci. 39, (1999) 868-873.
[Ertl 2000] Ertl, P., Rohde, B., Selzer, P.; Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and Its Application to the Prediction of Drug Transport Properties; J. Med. Chem. 43 (2000) 3714-3717.
[Hou 2004] Hou, T.J., Xia, K., Zhang, W., Xu, X.J.; ADME Evaluation in Drug Discovery. 4. Prediction of Aqueous Solubility Based on Atom Contribution Approach; J. Chem. Inf. Comput. Sci. 44, (2004) 266-275.
[Kier 1991] Kier, L.B.; Hall, L.H.; The Molecular Connectivity Chi Indices and Kappa Shape Indices in Structure-Property Modeling; In Reviews of Computational Chemistry; Boyd, D., Lipkowitz, K., Eds.; VCH Publishers: Inc., 1991; pp 367-422.
[Lipinski 1997] Lipinski, C.A., et. al.; Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings; Adv. Drug Deliv. Rev. 23 (1997) 3-25.
[Labute 2000] Labute, P.; A Widely Applicable Set of Descriptors; J. Mol. Graph. Mod. 18 (2000) 464-477.
[MDL 2007] MDL CTFile Format, downloadable from http://www.mdl.com/ downloads/public/ctfile/ctfile.jsp.
[Oprea 2000] Oprea, T.I.; Property Distribution of Drug-Related Chemical Databases; J. Comp. Aid. Mol. Des. 14 (2000) 251-264.
[Selzer 2006] Oellien, F., Cramer, J., Beyer, C., Ihlenfeldt, W.H., Selzer, P.M.; The Impact of Tautomer Forms on Pharmacophore-Based Virtual Screening; J. Chem. Inf. Model. 46 (2006) 2342-2354.
[Weininger 1988] Weininger, D.; SMILES 1. Introduction and Encoding Rules.; J. Chem. Inf. Comput. Sci. 28 (1988) 31-36.