遗传 ›› 2011, Vol. 33 ›› Issue (6): 654-660.doi: 10.3724/SP.J.1005.2011.00654

• 研究报告 • 上一篇    下一篇

基于蓝藻全基因组原始数据的转座元件挖掘及组成分析

肖鹏1, 2, 李仁辉1   

  1. 1. 中国科学院水生生物研究所, 中国科学院水生生物多样性与保护重点实验室, 武汉 430072 2. 中国科学院研究生院, 北京 100049
  • 收稿日期:2010-09-03 修回日期:2010-12-20 出版日期:2011-06-20 发布日期:2011-06-25
  • 通讯作者: 李仁辉 E-mail:reli@ihb.ac.cn
  • 基金资助:

    淡水生态与生物技术国家重点实验室项目(编号: 2011FB17)和国家重点基础研究发展规划(973计划)项目(编号: 2008CB418002)资助

Cyanobacterial genome transposable element mining and analysis based on 454 deep-sequencing data set

XIAO Peng1, 2, LI Ren-Hui1   

  1. 1. Key Laboratory of Aquatic Biodiversity and Conservation Biology, Institute of Hydrobiology, Chinese Academy of Sciences, Wuhan 430072, China 2. Graduate University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2010-09-03 Revised:2010-12-20 Online:2011-06-20 Published:2011-06-25
  • Contact: LI Ren-Hui E-mail:reli@ihb.ac.cn

摘要: 二代测序技术及全基因组多样性比较是现代生物学及信息科学研究的热点, 对基因组中转座元件(Transposable element) 的分析已成为基因组比较分析的重要组成部分。目前对于转座元件的种类、数量和组成的挖掘和分析一般是基于完全拼接后的全基因组序列, 对在此之前的海量短片段序列后期处理及拼接仍是目前基因组研究的盲点, 以转座元件为主的重复序列在拼接过程中也存在着不可避免的拼接误差或丢失, 给转座元件系统的分析带来不确定。文章旨在建立一套分析流程, 对铜绿微囊藻NIES 843全基因组构建的罗氏(Roche)公司454测序随机模拟原始数据集的转座元件(主要类型为插入序列: Insert sequence, IS)组成进行分析, 结果表明, 采用对核酸探针扫描后备选序列分成3组, 并分设氨基酸检测阈值的方案分析得到的结果较为可靠, 结果显示铜绿微囊藻NIES843的蓝藻转座元件占基因组比例的10.38%, 归属于14个IS家族, 66个IS亚家族。与之前基于完整拼接基因组数据的两套不同分析流程得到的结果相比, 在丰度及家族/亚家族组成上无显著差异, 在转座元件序列水平上也显示了高比例的相似性序列重叠, 证实了本研究流程在基于高通量测序原始数据的转座元件分析方面具可靠性及实用性。

关键词: 蓝藻基因组, 插入序列, IS家族, 转座元件, Roche 454测序原始数据

Abstract: Researches on the next generation sequencing (NGS) and the comparative genome analysis have recently been concerned. The analyses on transposable element composition and abundance are important parts for genome studies. Generally, the analyses of transposable element system were based on the complete spliced genomes; however, the post-processing and sequence splicing of the huge amount of short sequences from the 454 sequencer always encounter problems. Moreover, the occasion that large amount of repeat elements made up by transposable elements were incorrectly splicing or lost, leading to uncertain results. This study aimed at the construction of a framework to automatically analyze the insert sequence (IS) abundance and their composition based on a stimulated Roche 454 deep-sequencing data set, which was a 33-fold coverage of Microcystis aeruginosa NIES 843 genome. The result from the examination under the setting of three classes of division on the IS element candidates and a separated transposase examination thresholds is the most reliable. It showed that the abundance of IS element in this stimulated dataset was 10.38%, including 14 IS families and 66 IS subfamilies, which demonstrated no significant difference with the two sets of previous analysis results based on the spliced M. aeruginosa NIES 843 genome and a high percentage of IS element sequence overlap, indicating the reliability of this framework.

Key words: Cyanobacterial genome, insert sequence, IS family, transposable element, Roche 454 sequencing original data