• 研究报告 •

### 牛SNP芯片分型检出率和分型错误率对基因型填充准确率的影响

1. 1. 湖南农业大学动物科技学院,长沙410128
2. 美国怀俄明大学动物科学系,怀俄明州拉勒米市82071
3. 美国纽勤公司生物信息与生物统计部,内布拉斯加州林肯市68504
4. 美国威斯康星大学动物科学系,美国威斯康星州麦迪逊市53706
• 收稿日期:2018-11-30 修回日期:2019-04-16 出版日期:2019-07-20 发布日期:2019-05-28
• 通讯作者: 过伟,吴晓林 E-mail:wguo3@uwyo.edu;nwu@neogen.com
• 作者简介:李智,博士研究生,研究方向：动物遗传育种。E-mail:zli13@uwyo.edu
• 基金资助:
湖南省百人计划项目,湖南省重点研发计划项目(2018NK2081);湖南省畜禽安全协同创新中心项目和长沙市科技计划重点项目资助(kq1801014)

### Impacts of SNP genotyping call rate and SNP genotyping error rate on imputation accuracy inHolsteincattle

Li Zhi1,2,3,He Jun1,3,Jiang Jun1,4,G. Tait Jr. Richard3,Bauck Stewart3,Guo Wei2(),Wu Xiao-Lin1,3,4()

1. 1. CollegeofAnimalScienceand Technology, HunanAgricultural University, Changsha 410128, China
2. Department of Animal Science, University of Wyoming, Laramie WY 82071, USA
3. Biostatisticsand Bioinformatics, NeogenGeneSeek, LincolnNE68504, USA
4. Department of Animal Sciences, University of Wisconsin, Madison WI 53706, USA
• Received:2018-11-30 Revised:2019-04-16 Online:2019-07-20 Published:2019-05-28
• Contact: Guo Wei,Wu Xiao-Lin E-mail:wguo3@uwyo.edu;nwu@neogen.com
• Supported by:
Supported by Hundred-Talent Project of Hunan Province, Key Researchand Development Program of Hunan Province(2018NK2081);Hunan Innovation Center of Animal Safety Production and Key Researchand Development Program of Changsha City(kq1801014)

SNP芯片已被广泛应用于动植物的遗传研究和生产实践,其基因分型的准确性至关重要。但在实际应用中,常有一定数量的基因型因缺失而需要去估计(填充)。此外,由于各种原因,又常常需要在不同芯片的基因型之间相互填充彼此没有的SNP基因型,或从低密度SNP填充到高密度SNP基因型。因此,基因型填充准确率直接影响后续数据分析的准确性和可靠性。为深入了解基因型填充准确率的影响因素,本研究利用20 116头美国荷斯坦牛的50K SNP芯片基因分型数据,在SNP分型检出率与错误率存在相关和没有相关两种情形下,分别评估了上述两个因素对下游基因型填充准确率的影响。当两者不相关时,模拟的SNP分型检出率从100%降低到50%,SNP分型错误率由0%提升到50%。当两者存在相关时,基因分型的检出率和错误率之间的关系是基于一个实际数据中这两个变量之间的线性回归方程来确定,即模拟的SNP分型检出率从100%降低到50%,SNP分型错误率从0% 升高到 13.35%。最后,采用5折交叉验证的方法评估基因型填充的准确率。结果表明,当原始数据的SNP分型检出率与错误率彼此独立发生时,基因型填充的错误率受原始SNP分型检出率影响不大(P>0.05),却随着原始SNP分型错误率的升高而显著提高(P<0.01)。当原始数据的SNP分型检出率与错误率存在负相关时,基因型填充的错误率随着原始SNP分型检出率的降低而显著提高(P<0.01)。在这两种情形下,建议SNP分型检出率应在90%以上,基因型填充准确率才能不低于98%。该结果可为提升实际的SNP分型和下游数据分析的质控提供参考依据。

Abstract:

Single nucleotide polymorphism (SNP) chips have been widely used in genetic studies and breeding applications in animal and plant species. The quality of SNP genotypes is of paramount importance. More often than not, there are situations in which a number of genotypes may fail, requiring them to be imputed. There are also situations in which ungenotyped loci need to be imputed between different chips, or high-density genotypes need to be imputed based on low-density genotypes. Under these circumstances, the validity and reliability of subsequent data analyses is subject to the accuracy of these imputed genotypes. For justifying a better understanding of factors affecting imputation accuracy, in the present study, the impacts of SNP genotyping call rate and SNP genotyping error rate on the accuracy of genotype imputation were investigated under two scenarios in 20 116 U.S. Holstein cattle, each genotyped with a GGP 50K SNP chip. When the two factors were not correlated in scenario 1, simulated genotyping call rate varied from 50% to 100% and simulated genotyping error rate changed from 0% to 50%, with both factors being independent of each other. In scenario 2, genotyping error rates were correlated with genotyping call rate, and the relationship was set up by fitting a linear regression model between the two variables on a real dataset. That is, the simulated SNP call rate varied from 100% to 50% whereas the SNP genotyping rate changed from 0% to 13.55%. Finally, a 5-fold cross-validation was used to assess the subsequent imputation accuracy. The results showed that when original SNP genotyping call rate were independent of SNP genotyping error rate, the imputation accuracy did not change significantly with the original genotyping call rate (P>0.05), but it decreased significantly as the genotyping error rate increased (P<0.01). However, when original genotyping call rate was negatively correlated with genotyping error rate, the imputation error increased with elevated original genotyping error rate. In both scenarios, genotyping call rate needs to be no less than 0.90 in order to obtain 98% or higher genotype imputation accuracy. The present results can provide guidance for establishing quality assurance criteria for SNP genotyping in practice.