Intrinsic Protein Disorder in Complete Genomes

Pages 10
Views 7
of 10
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Intrinsic Protein Disorder in Complete Genomes A. Keith Dunker 1 Zoran Obradovic 2 Pedro Romero 2 Ethan C. Garner 1
Intrinsic Protein Disorder in Complete Genomes A. Keith Dunker 1 Zoran Obradovic 2 Pedro Romero 2 Ethan C. Garner 1 Celeste J. Brown 1 1 School of Molecular Biosciences, Washington State University, Pullman, WA , USA 2 School of Electrical Engineering and Computer Sciences, Washington State University, Pullman, WA Abstract Intrinsic protein disorder refers to segments or to whole proteins that fail to fold completely on their own. Here we predicted disorder on protein sequences from 34 genomes, including 22 bacteria, 7 archaea, and 5 eucaryotes. Predicted disordered segments 50, 40, and 30 in length were determined as well as proteins estimated to be wholly disordered. The five eucaryotes were separated from bacteria and archaea by having the highest percentages of sequences predicted to have disordered segments 50 in length: from 25% for Plasmodium to 41% for Drosophila. Estimates of wholly disordered proteins in the bacteria ranged from 1% to 8%, averaging to 3±2%, estimates in various archaea ranged from 2 to 11%, plus an apparently anomalous 18%, averaging to 7±5% that drops to 5±3% if the high value is discarded. Estimates in the 5 eucarya ranged from 3 to 17%. The putative wholly disordered proteins were often ribosomal proteins, but in addition about equal numbers were of known and unknown function. Overall, intrinsic disorder appears to be a common, with eucaryotes perhaps having a higher percentage of native disorder than archaea or bacteria. Keywords: intrinsic disorder, prediction, structural genomics 1 Introduction A major effort in Bioinformatics is the prediction of function from amino acid sequence, with 3D structure viewed as a prerequisite for function [26]. Thus, associating a sequence with a particular structural family [25] or with a particular sequence motif [17] provides an avenue to predict function. One difficult of this {Sequence} {3D Structure} {Function} paradigm is the identification of distantly related sequences [9]. A second difficulty is that motifs such as the TIM barrel have evolved different functions [25]. Thus, knowing the structure gives not one but a set of likely functions. A third difficulty is that one protein can have two or more completely unrelated functions, known as moonlighting [20]. A more fundamental problem with the {Sequence} {3D Structure} {Function} paradigm is that it is simply not true for many proteins. Intrinsic disorder, not fixed 3D structure, is sometimes required for function [28, 36]. By intrinsic disorder we are referring to ensembles of structures, such as a random coil or molten globule, with the various members of the ensemble in equilibrium with each other. From observations that variously shaped molecules bind competitively to serum albumin [21], 50 years ago Karush suggested that binding depends on an ensemble of structures in equilibrium. By now many additional examples of functional disorder are known; these include DNA recognition [33], enzymatic activation through proteolytic digestion [6], control of protein lifetimes [22], transport of an unfolded chain through a small orifice [8], and structural uncoupling of two or more domains by flexible linkers [19]. Given these many examples, identification of intrinsic disorder should be useful for inferring function. We are studying predictions of disorder from amino acid sequence [10, 15, 23, 29-31, 35]. Here we report the results of application of a predictor of disorder to the proteins from 34 genomes. The results show that disorder is a very common element of protein structure and indicate that eucaryotes may have a higher proportion of intrinsic protein disorder than bacteria or archaea. 2 Materials and Methods 2.1 Databases of intrinsically ordered and disordered protein segments The disordered data underlying the predictor utilized proteins having at least 40 consecutive disordered residues, with 1149 disordered residues in all. The X-ray characterized proteins had the following PDB Ids: 2tbv, 2ts1, 1aui, 1bgw, 1elo, 1bcl, 1ati, and 1lbh. The NMR-characterized sequences can be found in SWISS PROTEIN [2] (prio_mouse, h5_chick, flgm_salty, regn_lambd, hsf_klula, hmgi_human) or PIR [5] (S50866). The ordered database was constructed from randomly selected segments from NRL_3D [27], which contains ordered residues. An amount of order to balance the disorder was captured. To measure the false positive prediction rate, a database of ordered protein segments was constructed from PDB_SELECT_25 [16], which was based on grouping PDB proteins into families having 25% sequence identity. ORDERED_PDB_SELECT_25 (O_PDB_S25) was derived from PDB_SELECT_25 by removing residues lacking backbone coordinates. 2.2 Genomic Databases The amino acid sequences for known and putative proteins were obtained for 34 complete or mostly complete genomes from the NCBI ( 2.3 Predictor of intrinsic order and disorder The predictor of natural protein disorder (PONDR) used for these studies represents a merger of 3 predictors: one for variously characterized long (VL1) regions of internal disorder and two for X-raycharacterized disorder located at the chain termini (XT), giving PONDR VL-XT. Each of these predictors is a simple neural network, with either architecture (for VL1) or architecture (for the two XTs). The inputs were attributes such as hydropathy or compositions of certain amino acids calculated as simple averages over windows of 21 residues for VL1 [31] and over windows of various lengths for the two XTs [23]. The two XT predictors are described in much more detail elsewhere [23], as is the VL predictor and its merger with the two XT predictors [31]. 2.4 Application to genomic sequences and structural databases PONDR VL-XT was applied to the protein sequences longer than 60 residues in length in 34 genomes. For Methanococcus jannaschii, Escherichia coli, and Saccharomyces cerevisiae the prediction results were parsed on the basis of having sequence similarity to a known structure through the use of PEDANT [12, 13]. The PEDANT matches were based on IMPALA searches [32] against a library of position-specific scoring matrices derived from each PDB sequence using BLAST [1]. 3 Results 3.1 Predictor Error Estimation To estimate false positive prediction rates, PONDR VL-XT was applied to O_PDB_S25. These are putative errors because some of the proteins in these two databases exist in the crystals as complexes and are disordered in the absence of their partners. These false positive error rates were determined by two methods of analysis: 1. the per-chain error and 2. the per-prediction error (Table 1). As we have shown before [30], as the length (L) of the disorder prediction increases, the false positive error rate drops rapidly; in this case from 1% of windows and 17% of chains with L 30 to 0.1% of windows and 2% of chains with L 50. Table 1: False positive error rate for disorder predictions Analysis Fraction with L 50 Fraction with L 40 Fraction with L 30 Per-chain error 17/1111 2% 69/1111 6% 189/ % Per-prediction error 173/ % 702/ % 2252/ % 3.2 Prediction of wholly disordered proteins by cumulative distribution functions (CDFs) A few proteins, such as FlgM [8], 4E-BP1 [11]; HMG-I(Y) [18], and neuromodulin [39], are disordered from end-to-end under physiological conditions, and yet they carry out function. Estimating the number of such wholly disordered proteins in the various genomes is of interest. Figure 1 illustrates the method used for the identification of proteins that are likely to be wholly disordered. In this figure, 3 ways are shown for representing prediction data using 2 example proteins. A. 1 B. 100 Disorder Score Residue Frequency Disorder Score Bin C. Cumulative Frequency Disorder Score D. 100 Frequency Disorder score Figure 1. Development and application of CDF curves for identification of wholly disordered protein. A. Graph of disorder scores at each amino acid position for a disordered (dotted) and an ordered (bold) protein. B. Distribution of disorder scores for a disordered (cross hatched) and an ordered (solid) protein. C. CDF curves of disorder scores for disordered (dotted), and ordered (bold) proteins. D. A collection of fully ordered (bold) and fully disordered (dotted) proteins with an optimized boundary ( ). The output of PONDR VL-XT is 0.5 for a residue predicted to be ordered and 0.5 for a residue predicted to be disordered, so ordered and wholly disordered proteins tend to lie on either side of this boundary (Fig. 1A). Alternatively, the predictions can be displayed as histograms (Fig. 1B). From each histogram, a cumulative distribution function (CDF) [34] can be calculated by determining the fraction of the distribution that lies below a given value on the x-axis (Fig. 1C). CDFs have the advantage that overlapping histograms can become completely separated curves. The optimal boundary between datasets of completely ordered and completely disordered proteins was found by minimizing Error = #incorrect(o)/total(o) + #incorrect(d)/total(d), where #incorrect(o) indicates the number of ordered points incorrectly classified as disordered and #incorrect(d) indicates the number of disordered points incorrectly classified as ordered. This minimization was applied on a bin-by-bin basis with a bin size of 0.1. The optimization improved if boundary values for bins 0.1, 0.8, and 0.9 were simply omitted, thus yielding the boundary shown (Fig. 1D); CDF curves from the ordered and disordered protein sets are shown. Note the misclassification of two ordered proteins as disordered and four disordered as ordered. 3.3 Predicted disorder in 34 genomes PONDR VL-XT was applied to the known and putative protein sequences of length L 60 in 34 genomes. The numbers of predicted to-be-disordered segments with L 50, 40, and 30 were determined. In addition, the CDF analysis for the identification of putative wholly disordered proteins was also carried out. The results of these two analyses for the 34 genomes are given in Table 2. Table 2. Prediction of disorder in 34 genomes Kingdom Species # seqs L 30 L 40 L 50 CDF* Archaea Methanococcus jannaschii % 155 9% 71 4% 26 2% Archaea Pyrococcus horikoshii % % 164 8% 70 3% Archaea Pyrococcus abyssi % % 157 9% 62 4% Archaea Archaeoglobus fulgidus % % % 93 4% Archaea Methanobacterium thermoautotrophicum % % % 140 7% Archaea Halobacterium sp.nrc % % % % Archaea Aeropyrum pernix K % % % % Bacteria Ureaplasma urealyticum % 44 7% 14 2% 9 1% Bacteria Rickettsia prowazekii % 54 6% 23 3% 5 1% Bacteria Borrelia burgdorferi % 57 7% 26 3% 14 2% Bacteria Campylobacter jejuni % 148 6% 80 3% 21 1% Bacteria Mycoplasma genitalium % 39 8% 20 4% 10 2% Bacteria Helicobacter pylori % 140 9% 69 5% 24 2% Bacteria Aquifex aeolicus % % 94 6% 29 2% Bacteria Haemophilus influenzae % % 126 7% 27 2% Bacteria Bacillus subtilis % % 323 8% 87 2% Bacteria Escherichia coli % % 363 8% 107 2% Bacteria Vibrio cholerae % % 333 9% 93 2% Bacteria Mycoplasma pneumoniae % 95 14% 60 9% 14 2% Bacteria Xylella fastidiosa % % 246 9% 103 4% Bacteria Thermotoga maritima % % 165 9% 53 3% Bacteria Neisseria meningitidis MC % % 190 9% 64 3% Bacteria Chlamydia pneumoniae % % % 40 4% Bacteria Synechocystis sp % % % 104 3% Bacteria Chlamydia trachomatis % % 99 11% 42 5% Bacteria Treponema pallidum % % % 37 4% Bacteria Pseudomonas aeruginosa % % % 183 3% Bacteria Mycobacterium tuberculosis % % % 293 7% Bacteria Deinococcus radiodurans chr % % % 212 8% Eukaryota Plasmodium falciparum chr II, III % % % 11 3% Eukaryota Caenorhabditis elegans % % % % Eukaryota Arabodiopsis thaliana % % % 653 8% Eukaryota Saccharomyces cerevisiae % % % 356 6% Eukaryota Drosophila melanogaster % % % % *numbers and percentages of chains predicted to be wholly disordered by the CDF analysis of Figure 1. The percentage estimates in Table 2 are uncorrected for false positive error rates. From the per-chain false positive error rates in Table 1, the percentage values for L 50 should be reduced by ~ 2%, the values for L 40 by ~ 6% and the values for L 30 by ~ 17%. These corrections are only approximate. Wholly disordered proteins should not form crystals, so any such protein that has high sequence similarity to a protein in PDB would be a candidate for a prediction error. PEDANT [12, 13] was used to compare the putative wholly disordered proteins of three representative genomes with the proteins in PDB, one for each kingdom: M. jannaschii for the archaea, E. coli for the bacteria, and S. cerevisiae for the eucaryotes. Of the 26, 107, and 356 putative wholly disordered proteins in M. jannaschii, E. coli, and S. cerevisiae, respectively, 2 in M. jannaschii, 20 in E. coli and 56 in S. cerevisiae were associated with proteins in PDB. Further analysis (Table 3) shows that these associations might not all relate to prediction errors. For example, sometimes fragments of proteins rather than whole proteins are crystallized. Also, many intrinsically disordered proteins become ordered upon association with partners; such proteins can appear in PDB as ordered because the complex, not the individual protein, forms crystals. Finally, proteins in PDB may contain segments of disorder that are associated with the putative wholly disordered proteins. As indicated in Table 3, 1 of the 2 putative wholly disordered proteins in M. jannaschii, all but 1 of the 20 such proteins in E. coli, and all but 2 of the 56 such proteins in S. cerevisiae fall into one of the categories suggesting that these proteins might be intrinsically disordered despite their appearance in PDB. More work is needed to better define the correspondence between the putative wholly disordered proteins and the related proteins in PDB, but these comparisons show that the error rate could be much lower than that suggested by simple associations with proteins in PDB. Table 3. Putative wholly disordered proteins with sequence similarity to proteins of known 3D structure. Organism M. jannaschii E. coli S. cerevisiae Total number Only fragments visible 4 11 Bound to DNA 1 30 Bound to Protein 2 6 Bound to other Ligands 3 2 Bound within Di- or Multimers Contain visible regions of disorder 8 3 Unbound Monomers Functions of putative wholly disordered proteins With regard to genomic studies of intrinsic disorder, of prime importance is the relationship between intrinsic disorder and protein function. An initial step towards this goal is to determine the functions of the putative wholly disordered proteins. Thus, we searched the various databases to determine which of the putative wholly disordered proteins have a known function, which have a likely function, and which are unidentified reading frames (URFs) [9]. The results are compiled in Table 4 for the three representative genomes. Table 4: Categorization of putative wholly disordered proteins from 3 genomes M. jannaschii E. coli S. cerevisiae Ribosomal Known Function Likely structural motif URF Ribosomal proteins stood out in the CDF analysis. Proteins in the known function category were grouped by having been studied directly or by having a high sequence similarity to a protein of known function. Too many different functions are represented to give them all. Proteins in the likely structural motif category have identifiable structural characteristics such as a probable transmembrane segment, although no relationship to proteins of known function could be established. We conjecture that a protein with a likely structural motif is more likely to be a real protein than any URF with no such feature. 4 Discussion 4.1 Prediction error rates The relationship between per-prediction and per-chain error rates is simple (Table 1). The error rate of 0.1% came from the fact that the ~ 220,000 amino acids of O_PDB_S25 gave about 22 predictions of disorder of 50 or longer (e.g. 22/220,000 = 0.1%). Since this database contained 1,111 sequences, the per-chain error rate was about 22/1,111 or about 2%, indicating that two or more disorder predictions/chain were rare. Comparing the per-chain error rates of Table 1 with the disorder predictions in Table 2 indicates that most of the disorder predictions in the genomes are much higher than in O_PSB_S25. It is tempting to just subtract the error rates from Table 1 from the prediction rates of Table 2 to give corrected estimates. However, there are problems with such a simple approach. First, the per-chain error rate increases as the length of an ordered protein increases. Thus, the error rate for O_PDB_S25 could be subtracted from that of a given genome only if the length distributions in the two datasets matched, but in general such a match is unlikely. Second, the segments in O_PDB_S25 would have to have similar false positive prediction rates as the ordered segments in the given genome, but this is unlikely to be true. O_PDB_S25 contains representative chains from each family of the current set of crystallized structures, whereas genomes don t contain just one protein from each family, but rather have sets of similar proteins with variable numbers of elements in the different sets. In addition, the false negative errors, i.e. predictions of order for a residue or segment that is disordered, are not evaluated here. Our previous work shows that there are different types, or flavors, of disorder [15, 31, 35]. These flavors arise because the amino acid compositions of regions of intrinsic disorder are so variable; for example, one disordered region could be very rich in serine and glycine while another could be rich in lysine and arginine. Because of an uncertain number of flavors, it is not a simple task to estimate false negative error rates. 4.2 Commonness of intrinsic disorder Sequence databases such as SWISS PROTEIN and PIR contain proteins with significantly higher fractions of predictions of long regions of disorder as compared to the sequences in the PDB [30, 31]. These data suggested that nature is quite rich in disorder and furthermore that PDB is strongly biased against intrinsic disorder, probably because of the requirement for crystallization [29-31]. Sequence databases have their own biases. For example, the leucine zippers appear more often than their occurrence in nature. Thus, compared to the various sequence databases, characterization of disorder on a genome-by-genome basis provides a better way of estimating the commonness of intrinsic disorder. The bacteria and archaea exhibit a surprisingly wide range of disorder, with L 50 ranging from 2 to 24%, or with the fractions of putative wholly disordered proteins ranging from 1 to 18% (Table 2). There is no clear separation of the archaea and bacteria based on the amount of predicted disorder. Further experimentation and study are needed to determine whether these large disorder variations are real. The five eucaryotes exhibit the greatest amount of disorder as measured by L 50 as compared to any of the archaea or bacteria. That is, the five eucaryotes yield 25-41% of chains with predicted disorder of L 50 as compared to 24% for the highest bacterium. Data on more genomes are needed to determine whether eucaryotes indeed have more disorder as suggested here, especially given the two archaea that have nearly as much predicted disorder as the the lower range for the eucaryotes. Disorder enables complexes with low affinity coupled with high specificity [10, 33] and also facilitates the binding of one molecule to many partners [10, 21, 22, 36]. These two characteristics were discovered by trying to understand proteins that are involved in signalling pathways. Thus, the higher amount of disorder in the eucaryotes might relate to a greater need for control and regulation as compared to the archaea and bacteria. If so, the wide range of predicted disorder in the archaea and bacteria might relate to differences in usages of signalling and control in these organisms. If this speculation were true, the number of regu
Related Documents
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks