遗传

• 技术与方法 •    

基于机器学习的汉族人群地域来源推断模型

王帅琪1,2,王春年1,2,张德琴2,3,娄琳琳2,4,班一婷2,江丽2,李彩霞1,2   

  1. 1. 中国人民公安大学侦查学院,北京 100038

    2. 公安部鉴定中心,北京市现场物证检验工程技术研究中心,现场物证溯源技术国家工程实验室,北京 100038

    3. 贵州医科大学法医学院,贵阳 550004

    4. 山西医科大学法医学院,晋中 030600

  • 收稿日期:2025-12-23 修回日期:2026-02-18 发布日期:2026-03-06
  • 基金资助:
    国家重点研发计划项目(编号:2022YFC3341004),国家自然科学基金项目(编号:82171870),公安部技术研究计划项目(编号:2023JSZ02),北京市科技新星计划(项目编号:20220484149)资助[Supported by the National Key R&D Program of China (No. 2022YFC3341004), The National Natural Science Foundation of China (No. 82171870), The Science and Technology Program of the Ministry of Public Security (No. 2023JSZ02), Beijing Nova Program of Science and Technology (No. 20220484149)]


Machine learning-based geographical ancestry inference model for the Han Chinese population

Shuaiqi Wang1,2, Chunnian Wang1,2, Deqin Zhang2,3, Linlin Lou2,4, Yiting Ban2, Li Jiang2,

Caixia Li1,2   

  1. 1. School of Investigation, People’s Public Security University of China, Beijing 100038, China

    2. Institute of Forensic Science, Ministry of Public Security & Beijing Engineering Research Center of Crime Scene Evidence Examination & National Engineering Laboratory for Forensic Science, Beijing 100038, China

    3. Department of Forensic Medicine, Guizhou Medical University, Guiyang 550004, China

    4. School of Forensic Medicine, Shanxi Medical University, Jinzhong 030600, China

  • Received:2025-12-23 Revised:2026-02-18 Online:2026-03-06

摘要:

汉族人群具有复杂的遗传结构,不同地区人群存在一定程度的地域遗传差异,探究汉族人群的精细遗传结构,并构建高效的地域来源推断模型,对于揭示人群演化规律及实现精准祖源推断具有重要意义。然而当前针对国内汉族人群的祖源推断模型却较为缺乏。本研究旨在通过分析汉族人群高密度SNP数据,探索人群遗传结构与地理分布的关联,并基于机器学习算法构建地域来源推断模型,提升祖源推断技术对国内汉族人群的分辨力。研究选取来自中国8个省份的汉族人群全基因组SNP数据,通过连锁不平衡检验等进行质控并构建人群数据集,质控后共获得1,229份样本和208,193SNP位点,首先应用主成分分析(principal component analysisPCA)、ADMIXTURE聚类分析等方法进行遗传结构分析,结果表明不同地域的汉族存在一定程度的遗传结构差异,并据此将汉族人群划分为7个遗传分区。在此基础上,使用机器学习(machine learningML)算法,以PCA降维后主成分(principal componentPC)为输入特征,基于参考人群数据集5折交叉验证对比XGBoosteXtreme gradient boosting)、随机森林(random forestRF)和K近邻(K-nearest neighborsKNN)等不同机器学习分类模型的预测性能,引入似然比(likelihood ratioLR)方法作为评价指标,构建最优预测模型并在独立测试集中进行验证。结果表明,在参考集中,XGBoost模型预测性能最优,第一位预测准确率为87.66%LR准确率为96.87%。在测试集中,XGBoost模型第一位预测准确率达到85%以上,LR准确率95%以上,表明该模型具有良好的泛化能力。本研究开发的基于机器学习的中国汉族人群预测模型兼具高效性、稳健性与高准确性,为群体遗传学及法医遗传学等相关研究提供了可靠的方法学工具。

关键词: 汉族, 地域来源推断, 高密度SNP, 主成分分析, 机器学习

Abstract:

The Han Chinese population exhibits a complex genetic structure characterized by subtle yet discernible regional differentiation. Elucidating this fine-scale population structure and developing robust models for biogeographical ancestry inference are of great significance for revealing population evolutionary patterns and achieving precise ancestry inference. However, ancestry inference models specifically tailored to the genetic diversity within the domestic Han Chinese population remain scarce. In this study, we analyzed high-density SNP data from 1,229 Han Chinese individuals across eight provinces to investigate the correlation between genetic variation and geographic distribution, and to construct a machine learning–based model for regional ancestry prediction. After stringent quality control (including linkage disequilibrium pruning), we retained 208,193 SNPs for downstream analysis. Principal component analysis (PCA) and ADMIXTURE clustering revealed measurable genetic stratification corresponding to geography, supporting the delineation of seven distinct genetic clusters within the Han population. Leveraging the top principal components as features, we trained and compared multiple classifiers—XGBoost, random forest, and K-nearest neighbors—via five-fold cross-validation on the reference set, with model performance evaluated using both top-rank prediction accuracy and likelihood ratio (LR)-based metrics. XGBoost emerged as the optimal model, achieving a first-rank prediction accuracy of 87.66% and an LR-based accuracy of 96.87% in cross-validation. In independent test sets, the model maintained strong performance (first-rank accuracy >85%; LR accuracy >95%), demonstrating excellent generalizability and stability. We present a high-resolution, machine learning-driven ancestry inference framework tailored to Han Chinese populations. Its efficiency, robustness, and accuracy hold significant promise for applications in population genetics and forensic DNA intelligence, particularly in geographic sourcing of biological evidence.

Key words:

Han Chinese,  , biogeographic , ancestry inference,  , high-density SNP,  , principal component analysis,  , machine learning