[an error occurred while processing this directive]

Hereditas(Beijing) ›› 2022, Vol. 44 ›› Issue (11): 1028-1043.doi: 10.16288/j.yczz.22-073

• Research Article • Previous Articles     Next Articles

Optimization scheme of machine learning model for genetic division between northern Han, southern Han, Korean and Japanese

Yongqiang Kong1(), Jinkai Liu1, Jiaqi Gu2, Jingyi Xu1, Yunuo Zheng2, Yiliang Wei2(), Shaoyuan Wu1,2()   

  1. 1. Key Laboratory of Tianjin for Epigenetics, Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences, Tianjin Medical University, Tianjin 300070, China
    2. Key Laboratory of Phylogeny and Comparative Genomics of Jiangsu Province, Jiangsu Normal University, Xuzhou 221116, China
  • Received:2022-05-03 Revised:2022-07-13 Online:2022-11-20 Published:2022-08-11
  • Contact: Wei Yiliang,Wu Shaoyuan E-mail:kongyongqiang@tmu.edu.cn;weiyiliang.2013@tsinghua.org.cn;shaoyuan5@gmail.com
  • Supported by:
    Supported by the Key Laboratory of Forensic Genetics of China No(2020FGKFKT01);the Graduate Research and Practice Innovation Program of Jiangsu Normal University Nos(KYCX20_2286);the Graduate Research and Practice Innovation Program of Jiangsu Normal University Nos(KYCX21_2597)

Abstract:

Han Chinese, Korean and Japanese are the main populations of East Asia, and Han Chinese presents a gradient admixture from north to south. There are differences among the East Asian populations in genetic structure. To achieve fine-scale genetic classification of southern (S-) and northern (N-) Han Chinese, Korean and Japanese individuals in this study, we collected and analyzed 1185 ancestry informative SNPs (AISNPs) from previous literature reports and our laboratory findings. First, two machine learning algorithms, softmax and randomForest, were used to build genetic classification models. Then, phylogenetic tree, STRUCTURE and principal component analysis were used to evaluate the performance of classification for different AISNP panels. The 234-AISNP panel achieved a fine-scale differentiation among the target populations in four classification schemes. The accuracy of the softmax model was 92%, which realized the accurate classification of the S-Han, N-Han, Korean and Japanese individuals. The two machine learning models tested in this study provided important references for the high-resolution discrimination of close-range populations and will be useful tools to optimize marker panels for developing forensic DNA ancestry inference systems.

Key words: forensic genetics, ancestry informative SNPs, machine learning, East Asia, S-Han and N-Han