[an error occurred while processing this directive]

Hereditas(Beijing) ›› 2020, Vol. 42 ›› Issue (7): 691-702.doi: 10.16288/j.yczz.20-022

• Research Article • Previous Articles     Next Articles

Comprehensive re-annotation of protein-coding genes for prokaryotic genomes by Z-curve and similarity-based methods

Shuo Liu1, Zhi Zeng1, Fancai Zeng2, Mengze Du2()   

  1. 1. School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
    2. Department of Biochemistry and Molecular Biology, School of Basic Medicine, Southwest Medical University, Luzhou 646000,China
  • Received:2020-02-20 Revised:2020-05-11 Online:2020-07-20 Published:2020-06-01
  • Contact: Du Mengze E-mail:du_mengze@foxmail.com
  • Supported by:
    Supported by Science Strength Improvement Plan of University of Electronic Science and Technology of China No(Y0301902610100202)

Abstract:

The development of sequencing technology has generated huge genomic sequencing information and largely enriched public genetic resources. To analyze such big data, the algorithms and tools for comparison and annotation of genomes are updated continually, enabling genome annotation with higher accuracy via various annotation tools. Many prokaryotic genomes in public database were sequenced and assembled more than a decade ago, and they contained multiple genes with unknown functions. To improve the current annotation for those genomes in NCBI, we re-annotate 1587 bacterial and archaeal genomes using multiple prokaryotic gene recognition algorithms/softwares and gene expression data. The 33 Z-curve variables were applied to recognize sequences that were over-annotated to genes of 1587 bacterial and archaeal genomes deposited in public databases, and a total of 3092 sequences belonging to 177 genomes were recognized as sequences over-annotated as protein-coding genes. Next, 4447 protein-coding genes with unknown functions from 939 genomes were annotated with definite functions by similarity search. Finally, we recognized 2003 missed protein-coding genes that belong to known COG (clusters of orthologous groups of proteins) of nine genomes using three methods (ZCURVE 3.0, Glimmer 3.02 and Prodigal), which are accurate and frequently used for gene finding. Their algorithms are different and complementary. This is a comprehensive study for re-annotation of bacterial and archaeal genomes with new tools combining multi-omics data, which should provide a reference for annotation of newly sequenced strains, and also benefit further fundamental researches with the bacterial gene sequences obtained after re-annotation.

Key words: bacteria, re-annotation, Z-curve, hypothetical ORFs, non-coding ORFs