遗传

• 研究报告 •    

基于序列相似性和Z曲线方法重注释原核生物蛋白编码基因

刘硕   

  1. 电子科技大学
  • 收稿日期:2020-02-20 修回日期:2020-05-26 出版日期:2020-06-01 发布日期:2020-06-01
  • 通讯作者: 刘硕

Comprehensive re-annotation of protein-coding genes for prokaryotic genomes by Z-curve and similarity-based methods

Shuo -Liu   

  1. University of Electronic Science and Technology of China
  • Received:2020-02-20 Revised:2020-05-26 Online:2020-06-01 Published:2020-06-01
  • Contact: Shuo -Liu

摘要: 目前公共数据库的原核生物基因组测序和装配有些是十多年前的,存在大量预测的功能未知的编码基因。为了提升美国国家生物信息中心(National Center for Biotechnology information,NCBI)数据库中基因组的注释质量,本研究联合使用多种原核基因识别算法/软件和基因表达数据重注释1587个细菌和古细菌基因组。首先,利用Z曲线的33个变量从177个基因组原注释中识别获得3092个被过度注释为蛋白编码基因的序列。其次,通过同源比对为939个基因组中的4447个功能未知的蛋白编码基因注释上具体功能。最后,通过联合采用ZCURVE 3.0和Glimmer 3.02以及Prodigal这3种高精度的、广泛使用且基于算法不同而互补的基因识别软件来寻找漏注释基因。最终,从9个基因组中找到了2003个被漏注释的蛋白编码基因,这些基因属于多个蛋白质直系同源簇(clusters of orthologous groups of proteins,COG)。本研究使用新的工具并结合多组学数据重新注释早期测序的细菌和古细菌基因组,不仅为新测序菌株提供注释方法参考,而且这些重注释后得到的细菌基因序列也会对后续基础研究有所帮助。

关键词: 细菌, 重注释, Z曲线, 假定ORFs, 非蛋白编码ORFs

Abstract: Many prokaryotic genomes in public database were sequenced and assembled more than a decade ago, and they contained multiple genes with unknown functions. To improve the current annotation for those genomes in NCBI, we re-annotate 1587 bacterial and archaeal genomes using multiple prokaryotic gene recognition algorithms/softwares and gene expression data. The 33 Z-curve variables were applied to recognize sequences that were over-annotated to genes of 1587 bacterial and archaeal genomes deposited in public databases, and a total of 3092 sequences belonging to 177 genomes were recognized as sequences over-annotated as protein-coding genes. Next, 4447 protein-coding genes with unknown functions from 939 genomes were annotated with definite functions by similarity search. Finally, we recognized 2003 missed protein-coding genes that belong to known COG (clusters of orthologous groups of proteins) of nine genomes using three methods (ZCURVE 3.0, Glimmer 3.02 and Prodigal),which are accurate and frequently used for gene finding. Their algorithms are different and complementary. This is a comprehensive study for re-annotation of bacterial and archaeal genomes with new tools combining multi-omics data, which should provide a reference for annotation of newly sequenced strains, and also benefit further fundamental researches with the bacterial gene sequences obtained after re-annotation.

Key words: Bacteria, Reannotation, Z-curve, hypothetical ORFs, non-coding ORFs