遗传 ›› 2022, Vol. 44 ›› Issue (11): 1028-1043.doi: 10.16288/j.yczz.22-073

• 研究报告 • 上一篇    下一篇

南-北方汉族人、韩国人和日本人遗传划分机器学习模型优化方案

孔永强1(), 刘金凯1, 顾佳琪2, 徐景怡1, 郑雨诺2, 魏以梁2(), 伍少远1,2()   

  1. 1. 天津医科大学基础医学院生物化学与分子生物学系,天津市表观遗传学重点实验室,天津 300070
    2. 江苏师范大学,江苏省系统发育与比较基因组学重点实验室,徐州 221116
  • 收稿日期:2022-05-03 修回日期:2022-07-13 出版日期:2022-11-20 发布日期:2022-08-11
  • 通讯作者: 魏以梁,伍少远 E-mail:kongyongqiang@tmu.edu.cn;weiyiliang.2013@tsinghua.org.cn;shaoyuan5@gmail.com
  • 作者简介:孔永强,在读硕士研究生,专业方向:生物学。E-mail: kongyongqiang@tmu.edu.cn
  • 基金资助:
    法医遗传学公安部重点实验室开放课题(2020FGKFKT01);江苏省研究生科研与实践创新计划项目任务书(KYCX20_2286);江苏省研究生科研与实践创新计划项目任务书(KYCX21_2597)

Optimization scheme of machine learning model for genetic division between northern Han, southern Han, Korean and Japanese

Yongqiang Kong1(), Jinkai Liu1, Jiaqi Gu2, Jingyi Xu1, Yunuo Zheng2, Yiliang Wei2(), Shaoyuan Wu1,2()   

  1. 1. Key Laboratory of Tianjin for Epigenetics, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
    2. Key Laboratory of Phylogeny and Comparative Genomics of Jiangsu Province, Jiangsu Normal University, Xuzhou 221116, China
  • Received:2022-05-03 Revised:2022-07-13 Online:2022-11-20 Published:2022-08-11
  • Contact: Wei Yiliang,Wu Shaoyuan E-mail:kongyongqiang@tmu.edu.cn;weiyiliang.2013@tsinghua.org.cn;shaoyuan5@gmail.com
  • Supported by:
    Supported by the Key Laboratory of Forensic Genetics of China No(2020FGKFKT01);the Graduate Research and Practice Innovation Program of Jiangsu Normal University Nos(KYCX20_2286);the Graduate Research and Practice Innovation Program of Jiangsu Normal University Nos(KYCX21_2597)

摘要:

中国汉族人、韩国人和日本人作为东亚主体人群,其中中国汉族人呈现由北向南的梯度混合,在遗传结构上存在不同程度的差异。为实现对中国南-北方汉族人、韩国人和日本人的高分辨率遗传划分,本研究收集和分析了文献报道和实验室前期数据筛选出的1185个东亚人群祖先信息性SNPs (ancestry informative SNPs, AISNPs),应用softmax与随机森林两种机器学习算法构建族群遗传划分模型,然后利用系统发育树、STRUCTURE和主成分分析方法进一步评估不同模型AISNPs位点组合的族群分类效果,最终筛选出234-AISNP的最优组合,softmax模型准确率为92%,实现了南方汉族人、北方汉族人、韩国人和日本人的高精度区分。本研究测试的两种机器学习算法模型为近距离人群的高分辨率划分提供了重要参考,可作为法医DNA族群推断体系位点开发的重要工具。

关键词: 法医遗传学, 祖先信息位点, 机器学习, 东亚人群, 南北方汉族

Abstract:

Han Chinese, Korean and Japanese are the main populations of East Asia, and Han Chinese presents a gradient admixture from north to south. There are differences among the East Asian populations in genetic structure. To achieve fine-scale genetic classification of southern (S-) and northern (N-) Han Chinese, Korean and Japanese individuals in this study, we collected and analyzed 1185 ancestry informative SNPs (AISNPs) from previous literature reports and our laboratory findings. First, two machine learning algorithms, softmax and randomForest, were used to build genetic classification models. Then, phylogenetic tree, STRUCTURE and principal component analysis were used to evaluate the performance of classification for different AISNP panels. The 234-AISNP panel achieved a fine-scale differentiation among the target populations in four classification schemes. The accuracy of the softmax model was 92%, which realized the accurate classification of the S-Han, N-Han, Korean and Japanese individuals. The two machine learning models tested in this study provided important references for the high-resolution discrimination of close-range populations and will be useful tools to optimize marker panels for developing forensic DNA ancestry inference systems.

Key words: forensic genetics, ancestry informative SNPs, machine learning, East Asia, S-Han and N-Han