基于机器学习的汉族人群地域来源推断模型

doi:10.16288/j.yczz.25-261

遗传

• 技术与方法 •

基于机器学习的汉族人群地域来源推断模型

王帅琪^1,2，王春年^1,2，张德琴^2,3，娄琳琳^2,4，班一婷²，江丽²，李彩霞^1,2

1. 中国人民公安大学侦查学院，北京 100038

2. 公安部鉴定中心，北京市现场物证检验工程技术研究中心，现场物证溯源技术国家工程实验室，北京 100038

3. 贵州医科大学法医学院，贵阳 550004

4. 山西医科大学法医学院，晋中 030600

收稿日期:2025-12-23 修回日期:2026-02-18 发布日期:2026-03-06
基金资助:
国家重点研发计划项目（编号：2022YFC3341004），国家自然科学基金项目（编号：82171870），公安部技术研究计划项目（编号：2023JSZ02），北京市科技新星计划（项目编号：20220484149）资助[Supported by the National Key R&D Program of China (No. 2022YFC3341004), The National Natural Science Foundation of China (No. 82171870), The Science and Technology Program of the Ministry of Public Security (No. 2023JSZ02), Beijing Nova Program of Science and Technology (No. 20220484149)]

Machine learning-based geographical ancestry inference model for the Han Chinese population

Shuaiqi Wang^1,2, Chunnian Wang^1,2, Deqin Zhang^2,3, Linlin Lou^2,4, Yiting Ban², Li Jiang²,

Caixia Li^1,2

1. School of Investigation, People’s Public Security University of China, Beijing 100038, China

2. Institute of Forensic Science, Ministry of Public Security & Beijing Engineering Research Center of Crime Scene Evidence Examination & National Engineering Laboratory for Forensic Science, Beijing 100038, China

3. Department of Forensic Medicine, Guizhou Medical University, Guiyang 550004, China

4. School of Forensic Medicine, Shanxi Medical University, Jinzhong 030600, China

Received:2025-12-23 Revised:2026-02-18 Online:2026-03-06

摘要/Abstract

摘要：

汉族人群具有复杂的遗传结构，不同地区人群存在一定程度的地域遗传差异，探究汉族人群的精细遗传结构，并构建高效的地域来源推断模型，对于揭示人群演化规律及实现精准祖源推断具有重要意义。然而当前针对国内汉族人群的祖源推断模型却较为缺乏。本研究旨在通过分析汉族人群高密度SNP数据，探索人群遗传结构与地理分布的关联，并基于机器学习算法构建地域来源推断模型，提升祖源推断技术对国内汉族人群的分辨力。研究选取来自中国8个省份的汉族人群全基因组SNP数据，通过连锁不平衡检验等进行质控并构建人群数据集，质控后共获得1,229份样本和208,193个SNP位点，首先应用主成分分析（principal component analysis，PCA）、ADMIXTURE聚类分析等方法进行遗传结构分析，结果表明不同地域的汉族存在一定程度的遗传结构差异，并据此将汉族人群划分为7个遗传分区。在此基础上，使用机器学习（machine learning，ML）算法，以PCA降维后主成分（principal component， PC）为输入特征，基于参考人群数据集5折交叉验证对比XGBoost（eXtreme gradient boosting）、随机森林（random forest，RF）和K近邻（K-nearest neighbors，KNN）等不同机器学习分类模型的预测性能，引入似然比（likelihood ratio，LR）方法作为评价指标，构建最优预测模型并在独立测试集中进行验证。结果表明，在参考集中，XGBoost模型预测性能最优，第一位预测准确率为87.66%，LR准确率为96.87%。在测试集中，XGBoost模型第一位预测准确率达到85%以上，LR准确率95%以上，表明该模型具有良好的泛化能力。本研究开发的基于机器学习的中国汉族人群预测模型兼具高效性、稳健性与高准确性，为群体遗传学及法医遗传学等相关研究提供了可靠的方法学工具。

关键词: 汉族, 地域来源推断, 高密度SNP, 主成分分析, 机器学习

Abstract:

The Han Chinese population exhibits a complex genetic structure characterized by subtle yet discernible regional differentiation. Elucidating this fine-scale population structure and developing robust models for biogeographical ancestry inference are of great significance for revealing population evolutionary patterns and achieving precise ancestry inference. However, ancestry inference models specifically tailored to the genetic diversity within the domestic Han Chinese population remain scarce. In this study, we analyzed high-density SNP data from 1,229 Han Chinese individuals across eight provinces to investigate the correlation between genetic variation and geographic distribution, and to construct a machine learning–based model for regional ancestry prediction. After stringent quality control (including linkage disequilibrium pruning), we retained 208,193 SNPs for downstream analysis. Principal component analysis (PCA) and ADMIXTURE clustering revealed measurable genetic stratification corresponding to geography, supporting the delineation of seven distinct genetic clusters within the Han population. Leveraging the top principal components as features, we trained and compared multiple classifiers—XGBoost, random forest, and K-nearest neighbors—via five-fold cross-validation on the reference set, with model performance evaluated using both top-rank prediction accuracy and likelihood ratio (LR)-based metrics. XGBoost emerged as the optimal model, achieving a first-rank prediction accuracy of 87.66% and an LR-based accuracy of 96.87% in cross-validation. In independent test sets, the model maintained strong performance (first-rank accuracy >85%; LR accuracy >95%), demonstrating excellent generalizability and stability. We present a high-resolution, machine learning-driven ancestry inference framework tailored to Han Chinese populations. Its efficiency, robustness, and accuracy hold significant promise for applications in population genetics and forensic DNA intelligence, particularly in geographic sourcing of biological evidence.

Key words:

Han Chinese, , biogeographic , ancestry inference, , high-density SNP, , principal component analysis, , machine learning

王帅琪, 王春年, 张德琴, 娄琳琳, 班一婷, 江丽, 李彩霞. 基于机器学习的汉族人群地域来源推断模型[J]. 遗传, doi: 10.16288/j.yczz.25-261.

Shuaiqi Wang, Chunnian Wang, Deqin Zhang, Linlin Lou, Yiting Ban, Li Jiang, Caixia Li. Machine learning-based geographical ancestry inference model for the Han Chinese population[J]. Hereditas(Beijing), doi: 10.16288/j.yczz.25-261.

[1]	梁卉, 王雪, 司敬方, 张毅. 利用基因组标记和机器学习算法对中国牛品种的分类准确性研究[J]. 遗传, 2024, 46(7): 530-539.
[2]	郑慧怡, 吴华煊, 杜志强. 肠道宏基因组图像增强和深度学习改善代谢性疾病分类预测精度[J]. 遗传, 2024, 46(10): 886-896.
[3]	章子怡, 王棨临, 张俊有, 段迎迎, 刘家欣, 刘赵硕, 李春燕. 多组学数据驱动的机器学习模型在乳腺癌生存及治疗响应预测中的应用[J]. 遗传, 2024, 46(10): 820-832.
[4]	陈栋, 王书杰, 赵真坚, 姬祥, 申琦, 余杨, 崔晟頔, 王俊戈, 陈子旸, 王金勇, 郭宗义, 吴平先, 唐国庆. 基于机器学习的猪生长性状基因组预测[J]. 遗传, 2023, 45(10): 922-932.
[5]	王雪倩, 张庆珍, 程鹏, 董婷婷, 李卫国, 周喆, 王升启. 中国汉族人群66个InDel基因座的遗传多态性[J]. 遗传, 2022, 44(4): 335-345.
[6]	孔永强, 刘金凯, 顾佳琪, 徐景怡, 郑雨诺, 魏以梁, 伍少远. 南-北方汉族人、韩国人和日本人遗传划分机器学习模型优化方案[J]. 遗传, 2022, 44(11): 1028-1043.
[7]	李茜, 王浩宇, 曹悦岩, 朱强, 舒潘寅, 侯婷芸, 王雨婷, 张霁. 微单倍型遗传标记的法医基因组学研究[J]. 遗传, 2021, 43(10): 962-971.
[8]	刘志勇, 任贺, 陈冲, 张京晶, 张晓梦, 石妍, 石林玉, 陈滢, 程凤, 贾莉, 陈曼, 范庆炜, 张家榕, 李万婷, 王萌春, 任子林, 刘雅诚, 倪铭, 孙宏钰, 严江伟. 基于有限突变模型和大规模数据的19个常染色体STR的实际突变率研究[J]. 遗传, 2021, 43(10): 949-961.
[9]	刘明, 李祎, 杨亚芳, 晏于文, 刘凡, 李彩霞, 曾发明, 赵雯婷. 中国汉族人群脸部特征相关SNP位点研究[J]. 遗传, 2020, 42(7): 680-690.
[10]	胡雅丽, 戴睿, 刘永鑫, 张婧赢, 胡斌, 储成才, 袁怀波, 白洋. 水稻典型品种日本晴和IR24根系微生物组的解析[J]. 遗传, 2020, 42(5): 506-518.
[11]	张桂珊, 杨勇, 张灵敏, 戴宪华. 机器学习方法在CRISPR/Cas9系统中的应用[J]. 遗传, 2018, 40(9): 704-723.
[12]	赵学彤, 杨亚东, 渠鸿竹, 方向东. 组学时代下机器学习方法在临床决策支持中的应用[J]. 遗传, 2018, 40(9): 693-703.
[13]	彭哲也,唐紫珺,谢民主. 机器学习方法在基因交互作用探测中的研究进展[J]. 遗传, 2018, 40(3): 218-226.
[14]	阮修艳, 王伟妮, 杨雅冉, 谢兵兵, 陈婧, 刘雅诚, 严江伟. 北京汉族群体39个短串联重复序列基因座多态性及其遗传关系[J]. 遗传, 2015, 37(7): 683-691.
[15]	舒伟，林有坤，华荣，罗彦彦，方玲，许淑茹，何娜，马军，胡启平，李晓龙，袁志刚. 一个中国汉族皮肤和粘膜多发静脉血管畸形家系的单倍型分析[J]. 遗传, 2012, 34(4): 431-436.

基于机器学习的汉族人群地域来源推断模型

Machine learning-based geographical ancestry inference model for the Han Chinese population

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics