Journal Articles

Exhaustive and Iterative Clustering of the Protein Databank

by K. Kelly

Abstract. The unique high-resolution protein chains in the September 1998 edition of the Brookhaven Protein Databank have been subjected to an exhaustive and iterative sequence- and structure-based clustering procedure to produce a database of alignments for homology identification and modeling. A novel feature of the procedure was the use of multiple-sequence, structure-based alignment to validate hypothesized clusters. The resulting database contains fewer than 800 entries. Homology searches against this database, using rigorous sequence-to-group alignment, show high sensitivity and specificity when compared to existing methodologies. Four models were built based on families in the database and were submitted to the recently completed CASP3 competition.


It is not yet possible to predict the 3D structure of an expressed protein from its amino acid sequence alone. Consequently, inferences about the structure and function of new protein sequences are generally drawn based upon comparisons to sequences for which there already exist experimental models. To date, the most useful principle which is applied to guide such searches is the rule that "similar sequence" implies "similar structure," where sequence similarity is understood to be determined by application of the well-established dynamic programming paradigm (Needleman and Wunsch). Generally speaking, if a protein sequence shares more than 25% pairwise similarity with a known structure, it usually also shares at least the broad outlines of the fold topology of the known structure. However, the Protein Databank contains many pairs of similar structures that are remote homologues with less -- sometimes much less -- than 25% pairwise sequence identity. Standard homology searching tools often have difficulty identifying such remote relationships with any confidence, and even when a relationship is identified, the correct alignment of the new sequence to the homologous structure can be difficult to judge.

Researchers are actively working on strategies for improving the sensitivity and selectivity of homology searching as well as the accuracy of the alignments that are necessary to build homology models based on existing structures. Broadly speaking, one can divide such efforts into two classes.

The first class comprises methods that seek to to enhance the effectiveness of pure sequence-based queries by taking advantage of multiple-sequence alignment information in the form of profiles from particular protein families. The use of "signatures" -- short, amino acid sequence motifs --, judged to be characteristic of a family of related proteins, in the PROSITE and BLOCKS databases, represent such an effort (Henikoff, et al), as does the use of profile analysis in the recently developed PSI-BLAST (Altschul et al). Recently, Hidden Markov Models (HMMs) have been used to represent the information contained in multiple-sequence alignments. For example, the Pfam database (Sonnhammer et al) contains HMMs that represent various protein domains. It has been observed that the success of HMMs as homology detection devices is highly dependent on the quality of the initial alignments used to generate them (Henikoff et al).

The second class comprises efforts to develop "fold recognition" algorithms that make direct use of information about structural features in the targets. The term "threading" is often used to refer collectively to such methods, though it was first used to refer to the use of mean-force potentials of pairwise residue distances, as derived from databases of protein structures (Sippl). Alternatively, several related methods have been developed in the last few years which use statistical preferences of amino acids for certain structural environments to align an amino acid sequence to an environment string using dynamic programming. Such methods may also modify gap penalties to take advantage of knowledge of where insertions or deletions could realistically be made in a target structure. While some successes have been reported in identifying very remote homologues, a recent study (Jaroszewski et al) observed that structure-only based searching methods did not perform as well over a set of several fold recognition benchmarks as sequence-only based methods. The best results were observed using hybrid schemes with mixed sequence/structure scoring matrices.

The goal of the MOE project described in this paper is to improve homology identification and modeling by taking advantage of both the information implicit in multiple-sequence alignments, and the structural information available from experimental models. To do this, the structures in the Protein Databank were clustered into sets of proteins with related sequences and similar structures, using sequence and structure-based alignment methods to validate the hypothesized clusters and to guarantee that the alignments - which will be used for homology searching and modeling - accurately represent the structurally conserved cores of protein families.

This paper introduces the protein family database distributed in version 1998.10 of Chemical Computing Group's Molecular Operating Environment, describes the automated protocol used to build the database, and presents the test results which demonstrate improvements in the sensitivity of homology searches.

Methods and Materials

The raw material of this study included the contents of the September 1998 release of the Brookhaven Protein Databank (Bernstein et al), as well as the contents of two public domain sequence databases: Swiss-Prot version 37 (Bairoch and Apweiler) and the Protein Identification Resource (PIR) version 57.0 (Barker et al). Only high resolution X-ray models from the PDB were used (3.0 Angstroms or higher); chains that were shorter than 25 residues or contained internal chain breaks or non-standard amino acids were excluded. The following flowchart summarizes the overall clustering protocol that was applied to this data:

The first step was to perform all-against-all sequence alignments between all the unique protein sequences in the PDB that met the criteria described above. Based on the results of the sequence comparisons, links were created between any two chains whenever 90% of the residues of the longer chain were aligned against identical amino acids in the shorter chain. Using these links, the chains were clustered on a single-linkage basis - that is, all linked chains were put into the same cluster. Each cluster was then represented by one chain in the dataset that was submitted to the full iterative clustering procedure.

These chains were used as queries for sequence similarity searches of the PIR and Swiss-Prot databases. The criteria used to return matches from these searches were highly conservative, as there would be no structural information to confirm any hypothesized homologies. The purpose of collecting this extra sequence data prior to embarking on the iterative clustering was primarily to improve alignment quality though it was also possible that additional links between PDB chains might be identified that would otherwise have been missed.

Once the structure data had been reduced to a (relatively) non-redundant subset, and the PIR and Swiss-Prot data had been searched, the iterative phase of the procedure was entered. Each iteration began with a set of clusters. The first iteration began with a set of clusters containing one PDB entry each, and possibly augmented by sequences recruited from PIR or Swiss-Prot.

Using MOE-Align, with the sequence-only option enabled, each cluster was aligned against each other cluster, and a Z-score recorded. Z-scores were calculated by comparing raw alignment scores to a random distribution generated by random permutations of sequences in one or the other of the clusters being aligned. Then, single-linkage clustering was performed, where a link was considered to exist between two clusters if the Z-score was high enough.

This procedure created a set of hypothesized clusters, which were submitted to a validation protocol, which, roughly speaking, sought to establish whether or not all of the chains in the proposed cluster were "sufficiently superposable". The validation procedure, which involves the use of the MOE tools MOE-Align, MOE-Consensus and MOE-ProSuperpose, is summarized by the following chart:

MOE-Align was used to applied sequence and structure based alignment to the hypothesized cluster. This procedure is discussed in some detail in an earlier JCCG feature (Multiple Sequence and Structure Alignment in MOE), but a brief summary is as follows:

  1. Multiple sequence alignment, using tree-based build-up and randomized iterative refinment.
  2. Global-multi-body superposition of the alpha-Carbon traces
  3. Re-alignment using dynamic programming, where the scoring matrix was derived from the 3D positions of the alpha-Carbons after superposition.
  4. Re-superpose using the new alignment. If the RMSD has improved, go to step 3. Otherwise terminate.

After the alignment stage, the hypothesized cluster would be tested for acceptance. In view of the express purpose of this database - which is to create clusters from which accurate alignments and models can be created - the acceptance criteria was as follows :

The maximal set of alignment positions such that the worst pairwise RMSD, over these positions was less than 3.0 Angstroms was determined. If this set of alignment positions spanned a sufficient percentage of length of each chain (75%), then the cluster was accepted. otherwise it was rejected. If the cluster was rejected, then all subsets with more than two members would be tested in the same fashion.

The iterative clustering procedure was terminated when an iteration failed to produce any new clusters.

Results and Discussion

The 2300 chains were distributed into 755 families in the final database. 500 clusters contained only one PDB entry; there were 83 entries with more than 3 chains. By way of comparison, the SCOP database (Murzin et al) release 1.37, based on the October 1997 version of the Brookhaven PDB, distributed the proteins into slightly more than 800 "families" grouped into over 600 "superfamilies", where families were considered possess clear evolutionary relationships, and superfamilies "probable" evolutionary relationship. The clusters in MOE's family database generally corresponded to SCOP's families, though some spanned part or all of a superfamily. Some SCOP families were distributed in the MOE database into more than one cluster if the members were not adequately superposable. For example, the kinases were distributed into three families, with twitchin (PDB entry 1KOB) isolated by virtue of divergence form the othe kinases in the C-term region, and the insulin dependent kinases, which superpose to an RMSD of about 4.7 Angstroms to the other kinases, also put in their own cluster. There were other instances where SCOP families were merged - for example, the trypsins and serine proteases can by globally superposed to within 1.4 Angstroms RMSD (see Protein Analysis in MOE: The Serine Proteases).

As one of the express purposes of building this database was to improve the efficiency of remote homology detection, the following test was performed. The clusters were examined for instances of chains with less then 25% pairwise sequence identity to at least two other members in its cluster. Each such chain was then extracted from its cluster, and any chains of higher than 25% percent identity to it were discarded, and the remaining chains were re-aligned. Then, the extracted chain was aligned - using sequence information only - against each remaining member of the cluster, and against the cluster as a whole, and the maximum Z-scores achieved in the paiwise alignment was compared to the Z-score calculated against the cluster. For comparison's sake, same measurements were then made using 25% and 50% percentage identity thresholds. The results are summarized in the following table.

    Maximum %id Pairwise Z-score Cluster Z-score Difference
    < 25 7 10.5 3
    25 - 50 13 17 4
    >50 32 26 -6

It is worth noting that the increase in the average Z-score among the more remote homologues resulted in lifting the strength of the sequence homology signal well out of the "noise" range. The table below contains some example of the boost in Z-scores that were observed.

    Family PDB entry % identity Pairwise Z-score Cluster Z-score
    Globin 1ITH.A 22 8.3 13.5
    Lectin 1LCL 25 9.7 10.9
    Isocitrate dehydrogenase 1IDC 23.6 12.3 14.7

While comprehensive comparison to other standard searching methods has yet been made, there are various examples in the database of clusters which include remote homologues which appear not be detected by PSI-BLAST or Pfam. For example, consider the ferredoxin oxioreductase represented in the PDB by accession number 1A8P.

In MOE's family database, 1A8P was clustered with a set of five other protein chains. The percentage identities and pairwise RMSD values are shown in the tables below.


              1A8P   1CNF   1NDH   1FNB   1QUE   1FDR

 1 1A8P   :  100.0   15.4   11.9   13.2   12.2   30.7
 2 1CNF   :   15.6  100.0   35.6   10.8   12.2   13.9
 3 1NDH   :   12.5   36.9  100.0   12.5   12.9   16.8
 4 1FNB   :   15.2   12.3   13.7  100.0   47.9   16.8
 5 1QUE   :   14.4   14.2   14.4   49.0  100.0   15.2
 6 1FDR   :   29.2   13.1   15.2   13.9   12.2  100.0

Pairwise RMSD
    - lower triangle is pairwise superposition RMSD
    - upper triangle difference between pairwise and global RMSD

              1A8P   1CNF   1NDH   1FNB   1QUE   1FDR

  1 1A8P   :  0.000  0.000  0.000  0.000  0.000  0.000
  2 1CNF   :  2.896  0.000  0.000  0.000  0.000  0.000
  3 1NDH   :  2.934  1.570  0.000  0.000  0.000  0.000
  4 1FNB   :  2.733  2.975  3.139  0.000  0.000  0.000
  5 1QUE   :  2.857  3.039  3.193  0.818  0.000  0.000
  6 1FDR   :  1.724  2.575  2.637  2.714  2.883  0.000

pro_Superpose: global RMSD =    2.660

These proteins all contain two globular domains: an oxioreductase FAD/NAD binding domain at the C-term, and a cytochrome reductase domain at the N-term. When 1A8P was extracted and submitted as query to PSI-BLAST, only 1FDR was picked up above the default significance threshold. Pfam version 3.3 (December, 1998) reported homology only to the FAD/NAD binding domain, and not to the N-term domain, despite the fact that models of both domains were in the database.

When the sequence of target 62 from the recently concluded structure prediction competition CASP3 was used as a query, this family was reported as globally homologous with very large Z-score (over 12). Again, PSI-BLAST and Pfam reported homologies only to the C-term. A model of target 62 was submitted to CASP3 based in this family.

The other models submitted to CASP3 based on alignments to this database were targets 55, 69 and 82. The results will be discussed when they become publicly available. With the exception of target 69, the percentage identities of these sequences to the templates found in the database were on the order of 20%.


Altschul, S.F. et al Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402 (1997)

Barker, W. et al The PIR-International Sequence Database Nucleic Acids Research 26:27-32 (1998)

Bairoch, A. and Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1998. Nucleic Acids Research 26:38-42 (1998)

Bernstein F.C, Koetzle, T.F, Williams, G.J.B., Meyer, E.F., Brice M.D., Rodgers, J.R., Kennard, O., Shimanouchi, T. and Tasumi, M. The Protein Data Bank: a computer based archival file for macromolecular structures. Journal of Molecular Biology, 112:535-542 (1977)

Henikoff, S., Pietrokovski, S. and Henikoff, J. Superior Performance in protein homology detection with the Blocks Database servers Nucleic Acids Research 26:309-312 (1998)

Jaroszewski, L., Rychlewski L., Zhang, B., and Godzik, A. Fold Prediction by a hierarchy of sequence, threading, and modeling methods. Protein Science 7:1431-1440 (1998)

Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia C. Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247:436-540 (1995)

Needleman and Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48:443-453 (1970)

Sippl, M.J. Knowledge-based potentials for proteins. Current Opinion in Structural Biology 5:229-235 (1995)

Sonnhammer, E. et al, Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Research 26:320-322 (1998)