遗传 ›› 2014, Vol. 36 ›› Issue (6): 618-624.doi: 10.3724/SP.J.1005.2014.0618

• 技术与方法 • 上一篇    

基因组二代测序数据的自动化分析流程

李文轲1, 李丰余1, 2, 张思瑶1, 蔡斌1, 郑娜1, 聂宇1, 周到2, 赵倩1   

  1. 1. 中国医学科学院, 北京协和医学院, 国家心血管病中心, 阜外心血管病医院, 心血管疾病国家重点实验室, 北京 100037;
    2. 中南民族大学生物医学工程学院, 武汉430074
  • 收稿日期:2013-09-07 修回日期:2014-01-20 出版日期:2014-06-20 发布日期:2014-05-28
  • 通讯作者: 赵倩,博士,副研究员,研究方向:遗传学,生物信息学。E-mail:zhaoqian82@gmail.com E-mail:wksofia@gmail.com
  • 作者简介:李文轲,硕士,助理研究员,研究方向:生物信息学。Tel:010-88396071;E-mail:wksofia@gmail.com
  • 基金资助:

    国家重点基础研究发展计划(973计划)项目(编号:2010CB529505)和中央高校基本科研业务费专项资金(编号:2012-XHGX02)资助

Automatic analysis pipeline of next-generation sequencing data

Wenke Li1, Fengyu Li1, 2, Siyao Zhang1, Bin Cai1, Na Zheng1, Yu Nie1, Dao Zhou2, Qian Zhao1   

  1. 1. State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Disease, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100037, China;
    2. College of Biomedical Engineering, South-Central University for Nationalities, Wuhan 430074, China
  • Received:2013-09-07 Revised:2014-01-20 Online:2014-06-20 Published:2014-05-28

摘要:

二代测序技术的发展对测序数据的处理分析提出了很高的要求。目前二代测序数据分析软件很多, 但是绝大多数软件仅能完成单一的分析功能(例如:仅进行序列比对或变异读取或功能注释等), 如何能正确高效地选择整合这些软件已成为迫切需求。文章设计了一套基于perl语言和SGE资源管理的自动化处理流程来分析Illumina平台基因组测序数据。该流程以测序原始序列数据作为输入, 调用业界标准的数据处理软件(如:BWA, Samtools, GATK, ANNOVAR等), 最终生成带有相应功能注释、便于研究者进一步分析的变异位点列表。该流程通过自动化并行脚本控制流程的高效运行, 一站式输出分析结果和报告, 简化了数据分析过程中的人工操作, 大大提高了运行效率。用户只需填写配置文件或使用图形界面输入即可完成全部操作。该工作为广大研究者分析二代测序数据提供了便利的途径。

关键词: 二代测序, 自动化数据分析, 流程, 变异检测

Abstract:

The development of next-generation sequencing has generated high demand for data processing and analysis. Although there are a lot of software for analyzing next-generation sequencing data, most of them are designed for one specific function (e.g., alignment, variant calling or annotation). Therefore, it is necessary to combine them together for data analysis and to generate interpretable results for biologists. This study designed a pipeline to process Illumina sequencing data based on Perl programming language and SGE system. The pipeline takes original sequence data (fastq format) as input, calls the standard data processing software (e.g., BWA, Samtools, GATK, and Annovar), and finally outputs a list of annotated variants that researchers can further analyze. The pipeline simplifies the manual operation and improves the efficiency by automatization and parallel computation. Users can easily run the pipeline by editing the configuration file or clicking the graphical interface. Our work will facilitate the research projects using the sequencing technology.

Key words: next generation sequencing, automatic data analysis, pipeline, variantion detection