Tools for genome annotation




















RefSeq: an update on prokaryotic genome annotation and curation. NCBI prokaryotic genome annotation pipeline. National Center for Biotechnology Information , U.

PGAP is now available as a stand-alone software package. You can annotate your genomes on your own machine, local cluster or the Cloud! Get started by watching a short video! Related documentation: Annotation process Annotation standards Assemblies excluded from RefSeq Release notes GenBank The NCBI prokaryotic annotation pipeline is available as a stand-alone software package that you can run yourself to produce annotated genomes ready for submission to GenBank.

The differences for GenBank purposes are: non-WGS: Each chromosome is in a single sequence and there are no extra sequences Each sequence in the genome must be assigned to a chromosome or plasmid or organelle Plasmids and organelles can still be in multiple pieces. Plasmids and organelles can still be in multiple pieces.

Hence the incentives for automating as much of the annotation process as possible. Several other similar systems have been created since then, but GeneQuiz remains the only such tool that is open to the general public [ ]. The system further clusters proteins from the analyzed genome by sequence similarity [ ] and constructs multiple alignments.

The results are presented in a table that contains information on the best hits including gene names, database identifiers, and links to the corresponding databases , predictions for secondary structure, coiled-coil regions, etc.

The functional assignment is then made automatically on the basis of the functions of the homologs found in the database. At this level, functional assignments are qualified as clear or as ambiguous. The effectiveness and accuracy of such fully automated system have been the subject of a rather heated discussion but still remain uncertain. A similar discrepancy between the functional predictions made by the GeneQuiz team [ 31 ] and those obtained by mostly manual annotation [ ] was reported for the proteins encoded in the M.

It appeared that GeneQuiz analysis suffered from the usual pitfalls of sequence similarity searches see 3. While GeneQuiz seems to be the only fully automated genome annotation tool that is open to the public for new genome analysis, there have been reports of similar systems developed by other genome annotation groups.

Although none of these systems is freely available to outside users, many of the genome annotation results they produced are accessible on the web and can be used to judge the performance. The PEDANT web site contains by far the most information open to the public and can be used as a good reference point for automated genome analyses see also 2.

In addition to completely automated systems, some tools that greatly facilitate and accelerate manual genome annotation are worth a mention. SEALS combines software for retrieving sequence information, scripting database searches with BLAST , viewing and parsing search outputs, searching for protein sequence motifs using regular expressions, and predicting protein structural features and motifs.

Once these regions are identified and masked, database searches are run in a batch mode using the chosen method, e. SEALS has been extensively used in the comparative studies of bacterial, archaeal, and eukaryotic genomes e. Benchmarking the accuracy of genome annotation is extremely hard.

It has been shown on numerous occasions that more advanced methods for sequence comparison, such as gapped BLAST and subsequently PSI-BLAST, sometimes used in combination with threading, as well as various forms of motif analysis and careful manual integration of the results produced by all these approaches, substantially improve detection of homologs e. At the end, however, genome annotation is not about detection of homologs but rather about functional prediction, and here, the problem of a standard of truth is formidable.

By definition, functional annotation more precisely, functional prediction deals with proteins whose functions are unknown, and the rate of experimental testing of predictions is extremely slow.

We believe that it is possible to design an objective test of the accuracy of genome annotation in the following manner. The protein set encoded in a newly sequenced genome is analyzed, and specific active centers and other functionally important sites are predicted for as many proteins as possible. When a new, preferably phylogenetically distant genome becomes available, orthologs of the proteins from the first genome are identified, and the conservation of the predicted functional sites is assessed.

Lack of conservation would count as an error; this is, of course, a harsh test that would give the low bound of accuracy because: first, functional site prediction may be partly wrong but the function of the protein still would be predicted correctly; and second, some active sites might be disrupted in the new genome.

Steven Brenner published an interesting comparison of three independent annotations [ , , ] of the smallest of the sequenced bacterial genomes, Mycoplasma genitalium [ ]. In a similar exercise that we have done on the basis of the COG database, we found that of COGs that did not include paralogs the number for the end of , members of had conflicting annotations in GenBank [ ].

Clearly, even the lower of these estimates represents a serious problem for genome annotation, bringing up the specter of error catastrophe [ 89 , ]. We first briefly discuss the most common sources of errors and then some ideas regarding the ways out.

Manual and automated genome annotation encounter the same typical problems, which we already mentioned in the discussion of the reliability of sequence database records see 3.

Inevitably, even partial automation of the annotation process tends to increase the likelihood of all these types of errors. In order to examine various kinds of errors that are common in genome annotation, it is convenient to re-examine four cases of discrepancies in the annotation of M.

Although one of the authors was involved in one of the compared annotations, we think we can be completely impartial in the spirit of Brenner's article, especially since six years have passed, an eternity for genomics. The protein MG was not annotated in the original genome publication by Fraser and colleagues and was assigned conflicting annotations by the other two groups. A database search performed in leaves no doubt whatsoever that the protein is a permease; this is, of course, readily supported by transmembrane segment prediction.

However, the glycerolphosphate specificity is not supported at all. Instead, these searches, particularly the CDD search, unequivocally pointed to a relationship between MG and a family of cobalt transporters. This single case nicely covers several common problems of genome annotation. The most benign but also apparently most widespread of these is overprediction or, more precisely, overly specific prediction.

Such semantic snafus are pretty common in genome annotation, especially those that are either produced fully automatically or manually but non-critically e.

However, these are probably the least serious annotation errors. What is worse: the search result that presumably gave rise to this annotation is impossible to reproduce at this time, at least not without detailed research, which we are not willing to undertake. It is most likely that this blatantly wrong annotation was due to a spurious database hit to a ribosomal protein that was not critically assessed. It is not clear, in this particular case, how could this spurious hit pass the significance threshold, but in general, this happens most often because of the lack of proper filtering for low complexity or alternative approaches, such as composition-based statistics, which are available in but had not been developed in ; see Chapter 4.

Alternatively or additionally, the problem might lie in non-critical transfer of annotation from an unreliable database record , i. Notably, our re-analysis shows that the annotations assigned by each of the three groups were not completely correct: one was an outright error; another one involved overprediction; and the third one, an underprediction.

Although less notorious than false predictions false-positives, in statistical terms , lack of prediction, where a confident one is feasible with available methods, is still an error a false-negative.

The case of the MG protein is quite similar except that there was no clear false prediction involved. Today's searches support the latter decision because no convincing, specific relationship between this protein and transporters for any particular amino acid could be detected in fact, given the small repertoire of transporters in mycoplasmas, this one might have a broad specificity.

The MG protein was annotated as an oxidoreductase of different families in the original genome report and by Ouzounis and coworkers, whereas Koonin and coworkers predicted that it was an ATP GTP? In , database searches immediately identify this protein as HPr kinase this annotation is now correctly assigned to MG in GenBank , a regulator of the sugar phosphotransferase system, which indeed is a P-loop-containing, ATP-utilizing enzyme [ ].

Back in , this was the only informative annotation that could be derived for this protein; HPr kinase genes had not been identified at the time.

Once again, the specific source of the oxidoreductase assignments is hard to determine; spurious hits, non-critical use of incorrect database annotations, or a combination thereof must have caused this. The case of MG is of particular interest. A database search detects highly significantly similarity with numerous proteins that are annotated primarily as PMSR and, in some cases, as PilB-related repressors.

In reality, this protein is indeed a recently characterized, distinct form of PMSR, MsrB [ , ], which is evolutionarily unrelated to, but is often associated with, the classic PMSR, MsrA, either as part of a multidomain protein or as a separate gene in the same operon [ ].

Furthermore, in several bacteria, these two domains are fused to a third, thioredoxin domain. The three-domain protein of Neisseria gonorrhoeae has been characterized as a regulator of pili operon expression, and this is what caused the annotation of MG as PilB, which was reproduced by two groups.

Unrecognized multidomain architecture of either the analyzed protein or its homologs or both is a common cause of erroneous annotation. In retrospect, this looks like overprediction combined with insufficient information included in the annotation.

A straightforward annotation of MG as a PMSR-associated domain, perhaps with an extra prediction of redox activity on the basis of conservation of cysteines in this domain, the way it has been done in a subsequent publication [ ], would have been appropriate. We revisit this interesting set of proteins when discussing context analysis in Section 5. While considering only four proteins with contradictory annotations, we encountered all the main sources of systematic error in genome annotation.

We believe that this brief discussion highlights more general problems beyond these specific causes of errors. Even the apparently correct database annotations are insufficiently informative. Typically, the records do not include the evidence behind the prediction or include only minimal data that may be hard to interpret, such as E-values of the hits to particular domains. In this situation, any complicated case will not be represented adequately e.

In addition, there is no controlled vocabulary for genome annotation, which creates numerous semantic problems, although an attempt to correct this situation is being undertaken in the form of the Genome Ontology project [ 60 , ].

The above discussion shows that the general state of genome annotation is far from being satisfactory. What can be done to improve it? Curation, however, implies that databases other than GenBank will have to be employed because GenBank, by definition, is an archival database Chapter 3.

It appears that the future and, to some degree, already the present of genome annotation lies in specialized databases that actually function as annotation tools. Conceptually, the advantage of this approach may be viewed as reduction and structuring of the search space for genome annotation. Already genome annotation today is starting to change through the use of the new generation of databases and tools.

However, smooth integration of these and development of new, richer formats for annotation are things of the future. In the next subsection, we turn to a specific example to illustrate how the use of COGs helps genome annotation. Aeropyrum pernix was the first representative of the Crenarchaeota one of the two major branches of archaea; see Chapter 6 and the first aerobic archaeon whose genome has been sequenced [ ].

Given the intrinsic interest of the first crenarchaeal genome and also because of the unexpectedly low fraction of predicted genes that were assigned functions in the original report, A. Whenever A. To identify possible diverged COG members from A. Conversely, unexpected occurrence of A. The second round serves two purposes: first, to assign paralogs, that might have been missed in the first round, to existing COGs; and, second, to create new COGs from unassigned proteins.

The results of COG assignment for A. The number of identified false-negatives was even lower, but in this case, of course, it is not possible to determine how many proteins remain unassigned. It is further notable that the great majority of assigned proteins belonged to pre-existing COGs, which facilitates a nearly automatic annotation.

Assignment of predicted Aeropyrum pernix proteins to COGs. Altogether, 1, A. Some of these proteins were members of functionally uncharacterized COGs. These newly annotated A. Similarly, important functions in DNA replication and repair were confidently assigned to a considerable number of A.

The case of the large subunit of the archaeal-eukaryotic primase is particularly illustrative of the contribution of different types of inference to genome annotation. However, given the ubiquity of this subunit in euryarchaea and eukaryotes and the presence of a readily detectable small primase subunit in A. When the A. An interesting case of re-annotation of a protein with a critical function, which also led to more general conclusions, is the archaeal uracil DNA glycosylase UDG; COG The reason for the erroneous annotation of these proteins as DNA polymerases is already well familiar to us: independent fusion of the uracil DNA glycosylase with DNA polymerases was detected in bacteriophage SPO1 and in Yersinia pestis [ 44 ].

Although these fusions hampered the correct annotation in the original analysis of the archaeal genomes, they seem to be functionally informative, suggesting that this type of UDG functions in conjunction with the replicative DNA polymerase. The 1, COG members from A. It seems most likely that this was due to an overestimate of the total number of ORFs in the genome. Many of the A. This range is also consistent with the size of the A.

More conservatively, ORFs, originally annotated as probable protein-coding genes, significantly overlapped with COG members and could be confidently eliminated, which brings the total number of protein-coding genes in A. This regrettable pollution emphasizes the value of specialized, curated databases that are free of apparitions. Despite this overrepresentation of ORFs in A. This pilot analysis, while falling far short of the goal of comprehensive genome annotation, highlights some advantages of specialized comparative-genomic databases as annotation tools.

In this particular case, the original annotation probably had been overly conservative, which partly accounts for the large increase in the functional prediction rate. However, the employed protocol is general and, with modifications and addition of some extra procedures, has been used in primary genome analysis [ , ]. In other genome projects, the WIT system has been employed in a conceptually similar manner [ , ]. As shown above, this type of analysis yields reasonable accuracy of annotation, even when applied in a fully automated mode Table 5.

However, additional expert contribution, particularly in the form of context analysis discussed in the next section, adds substantial value to genome annotation. All the preceding discussion in this chapter centered on prediction of the functions of proteins encoded in sequenced genomes by extrapolating from the functions of their experimentally characterized homologs. The success of this approach depends on the sensitivity and selectivity of the methods that are used for detecting sequence similarity see Chapter 4 and on the employed rules of inference see 5.

There is no doubt that homology analysis remains the central methodology of genomics, i. However, a group of recently developed approaches in comparative genomics goes beyond sequence or structure comparison. These methods have become collectively and, we think, aptly known as genome context analysis [ , , , ].

More specifically, context in comparative genomics pertains to phyletic profiles of protein families, domain fusions in multidomain proteins, gene adjacency in genomes, and expression patterns. Indeed, genes whose products are involved in closely related functions e. This simple logic gives us a potentially powerful way to assign genes that have no experimentally characterized homologs to particular pathways or cellular systems.

Silica Biology, , Gene finding with a hidden Markov model of genome structure and evolution. Improving gene annotation of complete viral genomes. GeneComber: combining outputs of gene prediction programs for improved results. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19 Suppl 2, ii—ii Alexandersson, S.

Cawley, and L. Treats two alignments in a symmetric way, predicting pairs of transcripts SGP2 G. Parra, P. Agarwal, J. Abril, T. Wiehe, J. Fickett, and R. Comparative gene prediction in human and mouse. Used by the Mouse Genome Sequencing Consortium in to annotate the mouse genome.

Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Larsen TS, Krogh A. BMC Bioinformatics. It doesn't predict exons but rather validate exon predicted by other tools. Nucleic Acids Res — abinitio Z curve. EuGene'Hom: a generic similarity-based gene finder using multiple homologous sequences.

Nucleic Acids Res ; BMC Bioinformatics Ab initio, evidence Prokaryote, Archaea Unsupervised discovery of multiple gene classes using a self-organizing map. Computational gene prediction using multiple sources of evidence. Gene finding in novel genomes. Gene structure conservation aids similarity based gene prediction. Computational identification of evolutionarily conserved exons. New York: Assoc. PhD thesis, the University of Waterloo, Brown, Ming Li, and Tomas Vinaf.

ExonHunter: a comprehensive approach to gene finding. Bioinformatics, 21 Suppl. Allen and Steven L. Bioinformatics, , Gene and alternative splicing annotation with AIR. Gene identification in novel eukaryotic genomes by self-training algorithm. Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training. Test data sets and evaluation of gene prediction programs on the rice genome.

J Comp Sci Tech 20, — Efficient implementation of a generalized pair hidden Markov model for comparative gene finding. Gremme, V. Brendel, M. Sparks, and S. Engineering a software tool for gene structure prediction in higher organisms. Define only gene and CDS feature. Vertebrate gene finding from multiple-species alignments using a two-level strategy. Genome Biol. More than two genomes possible. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources.

Using multiple alignments to improve gene prediction. MetaGene: prokaryotic gene finding from environmental genome shotgun sequences. Tech M, Meinicke P. Prokaryote clustering algorithm for completely unsupervised scoring of potential TIS, based on positionally smoothed probability matrices.

Conrad: gene prediction using conditional random fields. Combines local classifiers with the global gene structure model. Can use more than 2 genomes. Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron—exon structure.

Bioinformatics 23, — Creating a honey bee consensus gene set. FLAN: a web server for influenza virus genome annotation. Comparative genomics search for losses of long-established genes on the human lineage.

PLOS Comput. Evidence Eukaryote Uses whole-genome alignments to project existing annotations from one genome to one or more other genomes. It distinguishs host and endosymbiont DNA. Combiner It uses proteins, transcripts Evigan: A hidden variable model for integrating gene evidence for eukaryotic gene prediction. Bioinformatics 24, — Weight of different sources. Unsupervised learning method Y. Zhou, Y. Liang, C. Hu, L. Wang, X. Combiner choose the best possible set of exons and combine them in a gene model weight of different sources.

Evidence based chooser. Jayaram, Surjit B. Dixit and David L. DNA Res — Plant Methods 5, 1—11 Ab initio gene identification in metagenomic sequences. Prodigal: prokaryotic gene recognition and translation initiation site identification. Log-likelihood coding statistics trained from data.

Ivanova N. Mikhailova N. Ovchinnikova G. Hooper S. Lykidis A. Kyrpides N. VIGOR, an annotation program for small viral genomes. BMC Bioinformatics Gnomon — the NCBI eukaryotic gene prediction tool. National Center for Biotechnology Information, MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects.

BMC Bioinformatics 12, Genomic Sci. Bioinformatics — abinitio evidence driven maximum-flow approach Eukaryote, Prokaryote Based on the observed mapping coverage, GIIRA identifies candidate genes that are refined in further validating steps. Sallet et al. Xiao H. Huang S.



0コメント

  • 1000 / 1000