Journal Articles

Protein Structure and Family Data in MOE 2001.01

Ken Kelly
Chemical Computing Group Inc.

Introduction | The PDB Reader | The Family Database | Family Examples | Summary | References



MOE 2001.01 contains an updated version of the MOE Protein Family Database, as well as a new supplemental MOE molecular database file containing models from all of the entries in the Protein Data Bank [Berman 2000] as of January, 2001. Apart from the inclusion of new data, the latest version of MOE's protein data resources have benefited from a re-developed PDB file format reader. The new distribution also includes a sample database file containing 170 protein-ligand complexes with their associated binding affinities.

In this article, the contents of the new family database are summarized and compared to the protein families and superfamilies in version 1.53 (July, 2000) of the SCOP [Murzin 1995], an expert-curated protein classification database. The new capabilities of MOE's PDB reader are also summarized.

The PDB Reader

The protein structure data made available by the Protein Data Bank is an essential resource. Unfortunately, some difficulties are presented to those attempting to take full advantage of this data. Among the more straightforward of these difficulties are the occasional simple errors in, for example, atom or residue naming or in the connectivity information. Apart from problems created by errors, there are at least two kinds of information which are either ill-represented in the PDB format, or represented but generally not extracted by most PDB readers. They are: 1) proper chemical type information for non-amino and non-nucleic acid chemical groups and 2) information relating the underlying biological sequence of modeled proteins to the atomic coordinate data. MOE's PDB reader has been enhanced to address both of these issues.

  • Chemical Type Information. Along with implicit type information for standard amino acid and nucleic acid residues, the PDB relies on external dictionaries to represent the chemical information required to properly assign atom hybridization and bond orders to the various non-standard or "hetero" groups. For increased flexibility and consistency in dealing with novel chemical groups, MOE's PDB reader now applies a modified version of the methodology described in [Meng 1991] to automatically assign connectivity, hybridization, ionization and LP hint attributes to all non-standard molecular data. In general, the connectivity assignments are superior to the connectivity records in PDB files, while the chemical type information produces reasonable results (protonation state and tautomer state are problematic, however).

  • Biological Sequence Information. For the purposes of sequence and structure classification, it is important to establish the correct mapping between the amino acid residues for which atom coordinates are given, and the underlying biological sequence of the protein presented in the SEQRES records. This is particularly important when there are interior gaps in a model, where atom positions could not be reliably deduced from the experimental data. Unfortunately, calculating the correct mapping is not always easy. Apart from errors in the original files, the most common complicating factors in properly aligning the biological sequence to the residue sequence corresponding to atomic data is the presence of non-standard amino acids in a model, and the various ways in which information on these residues is presented in PDB files. The new PDB reader in MOE 2001.01 has an option to automatically map sequence information from the SEQRES records in PDB files to the corresponding atomic data, using built-in alignment functionality. Modified and non-standard residue names are translated to their corresponding standard names, and empty residues are created to properly represent gaps in the model, or an unmodeled signal peptide region.

These two key capabilities allow MOE to successfully read all 14,000+ structures in the Protein Data Bank. As a result, more protein structures were available for the clustering procedure to deduce structural families.

The Family Database

MOE's Protein Family Database contains a set of alignments of a non-redundant set of protein structures from the Protein Data Bank and protein sequences from the PIR (v67.0) database [Barker 2001], the NCBI reference sequence project (RefSeq) [Pruitt 2000] and public domain sequences from the SWISS-PROT database [Bairoch 1997]. The protein family database is created and maintained using an entirely automated clustering protocol implemented in SVL, MOE's built-in high-level programming language. The clustering algorithm, described in a previous JCCG article. is summarized here, with changes to the original procedure noted.

The core of the protein clustering algorithm (depicted above) is an iterative cycle, in which all clusters at a given stage of the procedure are submitted to all-against-all group-to-group sequence comparisons, subject to restrictions on the degree to which the lengths of two alignments may differ. Potential relationships between clusters - determined by Z-score criteria - are then used to hypothesize a new set of clusters. At the cluster validation stage, any newly hypothesized families are submitted to multiple-sequence, structure-based alignment. New clusters are accepted if their resulting alignments meet a global RMSD threshold. The iterative stage terminates when no new clusters are proposed.

The most significant changes made to the overall procedure in producing the new family database for MOE 2001.01 were in the pre-iterative stage of the procedure. In earlier versions of the database, NMR data was excluded, as well as any crystal structures with unmodeled gaps in the backbone. In the new version, NMR models were not excluded, and small gaps (less than six residues) in the backbone model were permitted. As before, diffraction-based models solved at less than three angstroms resolution were excluded. Finally, the culling criteria were relaxed to allow near-identical sequences with substantially different structures in the non-redundant PDB subset.

The input data to the clustering procedure consisted of 4,544 chains in the non-redundant PDB subset, 29,570 sequences from the RefSeq, 20,151 from the fully-classified portion of the PIR, and 50,398 sequences from the SwissProt. Note that the sequence-only data was admitted into the family database using quite conservative criteria in order to guarantee reliable alignments in the absence of structural confirmation, and to allow a reliable assignment of putative secondary structure.

In the 2001.01 family database, there are 664 multiple structure alignments with a total of 2,884 protein structures. There are 1,061 structures in single-structure families: 781 of these are aligned to 2,029 sequences from the sequence-only databases. (Note that near-identical sequences from these alignments were pruned from the final database.) Only 280 protein structures are without any aligned data at all. The associated MOE molecular database file includes some 500 structures which are clear homologs of other structures in the family database, but which were judged by RMSD criteria to have different structures. MOE's total of 1,725 families (based on 14,288 PDB entries) is comparable to the SCOP v1.53 (based on 11,410 PDB entries), which defines 1,368 families distributed into 859 superfamilies. As the SCOP database is (primarily) a domain-oriented database, a strict comparison of these numbers is not completely meaningful, but their rough equivalence is encouraging.

Family Examples

As the MOE family database is not a domain database, the clustering results from MOE's automated procedure cannot be systematically compared to the entire classification from the domain-oriented SCOP. Also, structural alignment as a clustering criterion can occasionally result in an artificial separation of proteins such as some calcium binding proteins, which undergo large motions upon binding. Nevertheless, the output from MOE's clustering algorithm did agree with the SCOP definitions, where comparable, at the SCOP family level. More importantly, there are many instances where the clusters in MOE's database span multiple families from a SCOP superfamily. Two such examples - the protein kinase superfamily and the thioredoxins - are presented here. The two other examples presented here are illustrative of the advantages of not restricting the clusters to predefined domains. The first of these examples is the transcarbamoylase family where the MOE clustering reveals a structural detail which is obscured by fine-grained expert domain assignment in the SCOP. The final example is the multi-domain family of NADP(H):ferredoxin reductases. This family of proteins is not represented as such in the SCOP database - instead each of the two domains have separate family entries. Multi-domain structure-based alignments such as this one are, however, valuable targets in homology searching and modeling applications.

Proteins of the serine/threonine and tyrosine kinase superfamily catalyze protein phosphorylation, and participate in the regulation of many metabolic and gene expression pathways. The catalytic subunits of all of these enzymes are clustered into a single structural alignment in the MOE family database. The overall mainchain RMSD of the globally optimal superposition is less than three angstroms. The left-hand image below shows the conserved core of the entire superfamily. The right-hand image shows part of the core region of the human lymphocyte kinase, a tyrosine kinase (PDB entry 3LCK), and displays the hydrogen bond contacts between ASP_422 and the backbone amides of HIS_362 and ARG_363. These contacts act to stabilize the catalytic loop and properly orient ASP_166 towards the phosphorylation site. These contacts are entirely conserved throughout the superfamily.

The thioredoxins are small proteins that demonstrate a wide range of redox activity in plants and animals, all involving their two redox reversible sulfhydryl groups. Among the biological functions in which thioredoxins participate are i) serving as cofactors in ribonucleotide reduction, ii) promoting disulfide bond formation, and iii) assisting in the regulation of photosynthesis. Given the variety of functions performed, and some low sequence identities between certain members (less than ten percent, in some cases), the SCOP database segregates thioredoxins into eight families. MOE's clustering procedure places representatives of all single-domain thioredoxins in one structural alignment, with a global RMSD over the conserved core of 2.37 angstroms. The left-hand image below shows the conserved core of the entire aligned family. The right-hand image below is a ribbon representation of the conserved core of thioredoxin-2, from Anabaena (PDB entry code 1THX). The alpha-beta-alpha topology of the superfamily is clearly captured.

The transcarbamoylases catalyze the transfer of the carbamoyl moiety of carbamoyl phosphate to either ornithine or aspartate. Ornithine transcarbamoylase (OTC) is active in the urea cycle, and aspartate transcarbamoylase (ATCase) is crucial in the pyrimidine synthesis cycle. An X-linked recessive inheritance has been identified which results in OTC deficiency and consequent hyperammonemia. The catalytic units of ATCase and OTC are clearly homologous (typically about twenty percent), and the global RMSD over the conserved core is less than 2.0 angstroms.

These proteins contain two domains, which possess recognizably similar folds, (albeit at a very low level of sequence identity - on the order of ten percent), and in the SCOP database these two domains are treated as related instances of one domain family. However, neither domain appears in any other context, and examination of the structural alignment reveals that a long helix at the N-terminus (colored blue in the image above) is in fact a part of neither domain. In addition, application of MOE's Contact Analyzer reveals a completely conserved hydrogen bond contact involving a lysine from this helix and one residue from each of the two domains. This feature, illustrated in the image below, cannot be recognized in a purely domain based classification system.

Our final example is the NADP(H):ferredoxin reductases. The structures of these ferredoxin redox partners comprise a six-stranded anti-parallel beta sheet at the N-terminal, which binds FAD, and a five-stranded beta-sheet at the C-terminal, which binds NADPH/NAD. The structural alignment in the MOE family database includes eight PDB entries, with pairwise sequence identities ranging as low as fourteen percent. One of the targets in the CASP III blind-modeling exercise was from this family, and despite a sequence identity to its best template of less than twenty percent, was modeled by MOE to an accuracy of less than three angstroms RMS within the conserved core. At the time of the competition, none of the standard sequence searching tools could unequivocally identify proper templates for both domains, while a sequence search within MOE identified this family as a homolog at an unquestionable confidence level (Z-score in excess of thirteen). Shown below is an image of the consensus conserved cores of all of the members of this family. We see that even in the multi-domain superposition, all the key structural features of both domains are preserved.


In the MOE Protein Family Database, protein chains from the Protein Data Bank have been clustered into 1,725 families, including over 660 structure-based alignments and an additional 720 non-redundant alignments to public domain sequence data. Within the limits of legitimate comparison, these families, produced by an automatic method, generally match the SCOP family (often superfamily) classifications. Some advantages of multiple-domain structure alignments have been illustrated.

The family database is delivered with MOE 2001.01, and serves as a resource for MOE's built-in homology searching and homology modeling functions. MOE 2001.01 also includes a supplementary CD-ROM containing a molecular database file with the molecular data from all the Protein Data Bank entries as of January, 2001.


[Bairoch 1996] Bairoch, A., Apweiler, R. The SWISS-PROT Protein Sequence Data Bank and its New Supplement TrEMBL. Nucleic Acids Research 24, 21-25 (1996).
[Barker 2001] Barker, W.C., Garavelli, J.S., Hou, Z., Huang,H., Ledley,R.S., McGarvey, P.B,. Mewes, H.W., Orcutt, B.C., Pfeiffer, F., Tsugita, A., Vinayaka, C.R., Xiao, C., Yeh, L.S., Wu, C. Protein Information Resource: A Community Resource for Expert Annotation of Protein Data. Nucleic Acids Research 29, 29-32 (2001).
[Berman 2000] Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N, Weissig H., Shindyalov I.N., Bourne P.E. The Protein Data Bank. Nucleic Acids Research 28, 235-242 (2000).
[Meng 1991] Meng, E.C., Lewis, R.A. Determination of Molecular Topology and Atomic Hybridization States from Heavy Atom Coordinates. J. Comp. Chem. 12, No. 7, 891-898 (1991).
[Murzin 1995] Murzin A. G., Brenner S. E., Hubbard T., Chothia C. SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247, 536-540 (1995).
[Pruitt 2000] Pruitt, K.D., Katz, K.S., Sicotte, H., Maglott, D.R. Introducing RefSeq and LocusLink: Curated Human Genome Resources at the NCBI. Trends Genet. 16, 44-47 (2000).