遗传 ›› 2020, Vol. 42 ›› Issue (7): 691-702.doi: 10.16288/j.yczz.20-022

• 研究报告 • 上一篇    下一篇

基于序列相似性和Z曲线方法重注释原核生物蛋白编码基因

刘硕1, 曾志1, 曾凡才2, 杜萌泽2()   

  1. 1. 电子科技大学生命科学与技术学院,成都 611731
    2. 西南医科大学基础医学院,分子生物与生物化学教研室,泸州 646000
  • 收稿日期:2020-02-20 修回日期:2020-05-11 出版日期:2020-07-20 发布日期:2020-06-01
  • 通讯作者: 杜萌泽 E-mail:du_mengze@foxmail.com
  • 作者简介:刘硕,在读博士研究生,专业方向:微生物基因组学。E-mail: liushuo20022020@gmail.com
  • 基金资助:
    电子科技大学理科实力提升计划项目资助编号(Y0301902610100202)

Comprehensive re-annotation of protein-coding genes for prokaryotic genomes by Z-curve and similarity-based methods

Shuo Liu1, Zhi Zeng1, Fancai Zeng2, Mengze Du2()   

  1. 1. School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
    2. Department of Biochemistry and Molecular Biology, School of Basic Medicine, Southwest Medical University, Luzhou 646000,China
  • Received:2020-02-20 Revised:2020-05-11 Online:2020-07-20 Published:2020-06-01
  • Contact: Du Mengze E-mail:du_mengze@foxmail.com
  • Supported by:
    Supported by Science Strength Improvement Plan of University of Electronic Science and Technology of China No(Y0301902610100202)

摘要:

随着测序技术的不断发展,产生了海量的基因组测序数据,极大地丰富了公共遗传数据资源。同时为了应对大量基因组数据的产生,基因组比较和注释算法、工具不断更新,使得联合多种注释工具得到更准确的蛋白编码基因的注释信息成为可能。目前公共数据库的原核生物基因组测序和装配有些是10多年前的,存在大量预测的功能未知的编码基因。为了提升美国国家生物信息中心(National Center for Biotechnology Information, NCBI)数据库中基因组的注释质量,本研究联合使用多种原核基因识别算法/软件和基因表达数据重注释1587个细菌和古细菌基因组。首先,利用Z曲线的33个变量从177个基因组原注释中识别获得3092个被过度注释为蛋白编码基因的序列;其次,通过同源比对为939个基因组中的4447个功能未知的蛋白编码基因注释上具体功能;最后,通过联合采用ZCURVE 3.0和Glimmer 3.02以及Prodigal这3种高精度的、广泛使用且基于算法不同而互补的基因识别软件来寻找漏注释基因。最终,从9个基因组中找到了2003个被漏注释的蛋白编码基因,这些基因属于多个蛋白质直系同源簇(clusters of orthologous groups of proteins, COG)。本研究使用新的工具并结合多组学数据重新注释早期测序的细菌和古细菌基因组,不仅为新测序菌株提供注释方法参考,而且这些重注释后得到的细菌基因序列也会对后续基础研究有所帮助。

关键词: 细菌, 重注释, Z曲线, 假定ORFs, 非蛋白编码ORFs

Abstract:

The development of sequencing technology has generated huge genomic sequencing information and largely enriched public genetic resources. To analyze such big data, the algorithms and tools for comparison and annotation of genomes are updated continually, enabling genome annotation with higher accuracy via various annotation tools. Many prokaryotic genomes in public database were sequenced and assembled more than a decade ago, and they contained multiple genes with unknown functions. To improve the current annotation for those genomes in NCBI, we re-annotate 1587 bacterial and archaeal genomes using multiple prokaryotic gene recognition algorithms/softwares and gene expression data. The 33 Z-curve variables were applied to recognize sequences that were over-annotated to genes of 1587 bacterial and archaeal genomes deposited in public databases, and a total of 3092 sequences belonging to 177 genomes were recognized as sequences over-annotated as protein-coding genes. Next, 4447 protein-coding genes with unknown functions from 939 genomes were annotated with definite functions by similarity search. Finally, we recognized 2003 missed protein-coding genes that belong to known COG (clusters of orthologous groups of proteins) of nine genomes using three methods (ZCURVE 3.0, Glimmer 3.02 and Prodigal), which are accurate and frequently used for gene finding. Their algorithms are different and complementary. This is a comprehensive study for re-annotation of bacterial and archaeal genomes with new tools combining multi-omics data, which should provide a reference for annotation of newly sequenced strains, and also benefit further fundamental researches with the bacterial gene sequences obtained after re-annotation.

Key words: bacteria, re-annotation, Z-curve, hypothetical ORFs, non-coding ORFs