遗传 ›› 2024, Vol. 46 ›› Issue (7): 530-539.doi: 10.16288/j.yczz.24-059
收稿日期:
2024-03-11
修回日期:
2024-06-25
出版日期:
2024-07-20
发布日期:
2024-06-26
通讯作者:
张毅,教授,研究方向:牛基因组育种技术。E-mail: yizhang@cau.edu.cn
作者简介:
梁卉,硕士研究生,专业方向:群体基因组学。E-mail: lhonwork6ye@163.com
基金资助:
Hui Liang(), Xue Wang, Jingfang Si, Yi Zhang()
Received:
2024-03-11
Revised:
2024-06-25
Published:
2024-07-20
Online:
2024-06-26
Supported by:
摘要:
品种分类是畜禽品种遗传资源保护和利用的基础,传统分类方法主要依赖于体型外貌特征判断,但因分类指标不易量化,故难以区分相似度较高的品种。机器学习算法在利用基因组信息进行品种分类方面显示出独特优势。为了探索最适合于中国牛品种的分类方法,本研究使用7个地方品种共213头牛的基因组SNP数据,对比了FST值排序筛选、mRMR、Relief-F三种SNP选择方法和随机森林(Random Forest, RF)、支持向量机(Support Vector Machine, SVM)、朴素贝叶斯(Naive Byes, NB)三种不同机器学习算法对品种分类准确性的影响。结果表明:1)使用FST方法筛选1500个以上SNP,或使用mRMR算法筛选1000个以上SNP,SVM分类算法可以达到99.47%以上的分类准确率;2)分类效果最好的算法是SVM算法,其次是NB算法,而最好的SNP选择方法是FST和mRMR算法,其次是Relief-F;3)品种错误归类情况常出现在相似性较高的品种间。本研究显示机器学习分类模型结合基因组数据是对牛地方品种鉴别的有效方法,为我国牛品种的快速准确分类提供了技术依据。
梁卉, 王雪, 司敬方, 张毅. 利用基因组标记和机器学习算法对中国牛品种的分类准确性研究[J]. 遗传, 2024, 46(7): 530-539.
Hui Liang, Xue Wang, Jingfang Si, Yi Zhang. Classification accuracy of machine learning algorithms for Chinese local cattle breeds using genomic markers[J]. Hereditas(Beijing), 2024, 46(7): 530-539.
[1] |
Sun H, Olasege BS, Xu Z, Zhao QB, Ma PP, Wang QS, Lu SX, Pan YC. Genome-wide and trait-specific markers: a perspective in designing conservation programs. Front Genet, 2018, 9: 389.
doi: 10.3389/fgene.2018.00389 pmid: 30283493 |
[2] |
Maudet C, Luikart G, Taberlet P. Genetic diversity and assignment tests among seven French cattle breeds based on microsatellite DNA analysis. J Anim Sci, 2002, 80(4): 942-950.
pmid: 12002331 |
[3] |
Suekawa Y, Aihara H, Araki M, Hosokawa D, Mannen H, Sasazaki S. Development of breed identification markers based on a bovine 50K SNP array. Meat Sci, 2010, 85(2): 285-288.
doi: 10.1016/j.meatsci.2010.01.015 pmid: 20374900 |
[4] | Lewis J, Abas Z, Dadousis C, Lykidis D, Paschou P, Drineas P. Tracing cattle breeds with principal components analysis ancestry informative SNPs. PLoS One, 2011, 6(4): e18007. |
[5] |
Putnová L, Štohl R. Comparing assignment-based approaches to breed identification within a large set of horses. J Appl Genet, 2019, 60(2): 187-198.
doi: 10.1007/s13353-019-00495-x pmid: 30963515 |
[6] | Gao J, Sun LW, Zhang SS, Xu JH, He MQ, Zhang DF, Wu CF, Dai JJ. Screening discriminating SNPs for Chinese indigenous pig breeds identification using a random forests algorithm. Genes (Basel), 2022, 13(12): 2207. |
[7] |
Sharma A, Dey P. A machine learning approach to unmask novel gene signatures and prediction of Alzheimer's disease within different brain regions. Genomics, 2021, 113(4): 1778-1789.
doi: 10.1016/j.ygeno.2021.04.028 pmid: 33878365 |
[8] |
Zhang ZS, Liu ZP. Robust biomarker discovery for hepatocellular carcinoma from high-throughput data by multiple feature selection methods. BMC Med Genomics, 2021, 14(Suppl 1): 112.
doi: 10.1186/s12920-021-00957-4 pmid: 34433487 |
[9] | Yang YL, Wang XY, Wang SY, Chen Q, Li M L, Lu SX. Identification of potential sex-specific biomarkers in pigs with low and high intramuscular fat content using integrated bioinformatics and machine learning. Genes (Basel), 2023, 14(9): 1695. |
[10] |
Weir BS, Cockerham CC. Estimating F-statistics for the analysis of population structure. Evolution, 1984, 38(6): 1358-1370.
doi: 10.1111/j.1558-5646.1984.tb05657.x pmid: 28563791 |
[11] |
Schiavo G, Bertolini F, Galimberti G, Bovo S, Dall’Olio S, Nanni Costa L, Gallo M, Fontanesi L. A machine learning approach for the identification of population-informative markers from high-throughput genotyping data: application to several pig breeds. Animal, 2020, 14(2): 223-232.
doi: 10.1017/S1751731119002167 pmid: 31603060 |
[12] | Zhao CH, Wang D, Teng J, Yang C, Zhang XY, Wei XM, Zhang Q. Breed identification using breed-informative SNPs and machine learning based on whole genome sequence data and SNP chip data. J Anim Sci Biotechnol, 2023, 14(1): 85. |
[13] | Kumar H, Panigrahi M, Chhotaray S, Parida S, Chauhan A, Bhushan B, Gaur GK, Mishra BP, Singh RK. Comparative analysis of five different methods to design a breed- specific SNP panel for cattle. Anim Biotechnol, 2021, 32(1): 130-136. |
[14] | Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine learning based computational gene selection models: a survey, performance evaluation, open issues, and future research directions. Front Genet, 2020, 11: 603808 |
[15] | Liu RQ, Xu ZT, Teng JY, Pan XC, Lin Q, Cai XD, Diao SQ, Feng XY, Yuan XL, Li JQ, Zhang Z. Evaluation of six machine learning classification algorithms in pig breed identification using SNPs array data. Anim Genet, 2023, 54(2): 113-122. |
[16] |
Pasupa K, Rathasamuth W, Tongsima S. Discovery of significant porcine SNPs for swine breed identification by a hybrid of information gain, genetic algorithm, and frequency feature selection technique. BMC Bioinformatics, 2020, 21(1): 216.
doi: 10.1186/s12859-020-3471-4 pmid: 32456608 |
[17] |
Hayah I, Ababou M, Botti S, Badaoui B. Comparison of three statistical approaches for feature selection for fine-scale genetic population assignment in four pig breeds. Trop Anim Health Prod, 2021, 53(3): 395.
doi: 10.1007/s11250-021-02824-x pmid: 34245361 |
[18] |
Hulsegge B, Calus MPL, Windig JJ, Hoving-Bolink AH, Maurice-van Eijndhoven MHT, Hiemstra SJ. Selection of SNP from 50K and 777K arrays to predict breed of origin in cattle. J Anim Sci, 2013, 91(11): 5128-5134.
doi: 10.2527/jas.2013-6678 pmid: 24045484 |
[19] |
Judge MM, Kelleher MM, Kearney JF, Sleator RD, Berry DP. Ultra-low-density genotype panels for breed assignment of Angus and Hereford cattle. Animal, 2017, 11(6): 938-947.
doi: 10.1017/S1751731116002457 pmid: 27881206 |
[20] |
Liu YX, Zhang NN, He Y, Lun LJ. Prediction of core cancer genes using a hybrid of feature selection and machine learning methods. Genet Mol Res, 2015, 14(3): 8871-8882.
doi: 10.4238/2015.August.3.10 pmid: 26345818 |
[21] | Shreem SS, Abdullah S, Nazri MZA, Alzaqebah MA. Hybridizing ReliefF, MRMR filters and GA wrapper approaches for gene selection. J Theor Appl Inf Technol, 2012, 46(2): 1034-1039. |
[22] |
Gao YH, Gautier M, Ding XD, Zhang H, Wang YC, Wang X, Faruque MO, Li JY, Ye SH, Gou X, Han JL, Lenstra JA, Zhang Y. Species composition and environmental adaptation of indigenous Chinese cattle. Sci Rep, 2017, 7(1): 16196.
doi: 10.1038/s41598-017-16438-7 pmid: 29170422 |
[23] |
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, Daly MJ, Sham PC. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet, 2007, 81(3): 559-575.
doi: 10.1086/519795 pmid: 17701901 |
[24] | Wickham H. ggplot2: elegant graphics for data analysis. Springer New York, 2009. |
[25] | Hanchuan P, Fuhui L, Chris D. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell, 2005, 27(8): 1226-1238. |
[26] |
De Jay N, Papillon-Cavanagh S, Olsen C, El-Hachem N, Bontempi G, Haibe-Kains B. mRMRe: an R package for parallelized mRMR ensemble feature selection. Bioinformatics, 2013, 29(18): 2365-2368.
doi: 10.1093/bioinformatics/btt383 pmid: 23825369 |
[27] | Robnik-Šikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF. Mach Learn, 2003, 53(1): 23-69. |
[28] | Robnik-Šikonja M, Savicky P. CORElearn: classification, regression and feature evaluation. 2020. |
[29] | Liaw A, Wiener M. Classification and regression by RandomForest. Forest, 2001, 23(2/3): 18-22. |
[30] | Meyer D. Support Vector Machines∗ the interface to libsvm in package e1071. 2001. |
[31] |
Zhang ZH. Naïve Bayes classification in R. Ann Transl Med, 2016, 4(12): 241.
doi: 10.21037/atm.2016.03.38 pmid: 27429967 |
[32] | Xu ZT, Diao SQ, Teng JY, Chen ZT, Feng XY, Cai XT, Yuan XL, Zhang H, Li JQ, Zhang Z. Breed identification of meat using machine learning and breed tag SNPs. Food Control, 2021, 125: 107971. |
[33] |
Bertolini F, Galimberti G, Calò DG, Schiavo G, Matassino D, Fontanesi L. Combined use of principal component analysis and random forests identify population- informative single nucleotide polymorphisms: application in cattle breeds. J Anim Breed Genet, 2015, 132(5): 346-356.
doi: 10.1111/jbg.12155 pmid: 25781205 |
[34] |
Li B, Zhang NX, Wang YG, George AW, Reverter A, Li YT. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front Genet, 2018, 9: 237.
doi: 10.3389/fgene.2018.00237 pmid: 30023001 |
[35] | Yang JF, Qiao PR, Li YM, Wang N. A review of machine-learning classification and algorithms. Statistics & Decision, 2019, 35(06): 36-40. |
杨剑锋, 乔佩蕊, 李永梅, 王宁. 机器学习分类问题及算法研究综述. 统计与决策, 2019, 35(6): 36-40. | |
[36] | Zhang Y, Ding C, Li T. Gene selection algorithm by combining reliefF and mRMR. BMC Genomics, 2008, 9(2): S27. |
[37] | Wilmot H, Bormann J, Soyeurt H, Hubin X, Glorieux G, Mayeres P, Bertozzi C, Gengler N. Development of a genomic tool for breed assignment by comparison of different classification models: application to three local cattle breeds. J Anim Breed Genet, 2022, 139(1): 40-61. |
[38] | Huang JJ. Identify pig breeds with different methods based on SNP chip[Dissertation]. South China Agricultural University, 2019. |
黄进杰. 基于SNP芯片利用不同方法鉴定个体猪品种[学位论文]. 华南农业大学, 2019. |
[1] | 陈栋, 王书杰, 赵真坚, 姬祥, 申琦, 余杨, 崔晟頔, 王俊戈, 陈子旸, 王金勇, 郭宗义, 吴平先, 唐国庆. 基于机器学习的猪生长性状基因组预测[J]. 遗传, 2023, 45(10): 922-932. |
[2] | 孔永强, 刘金凯, 顾佳琪, 徐景怡, 郑雨诺, 魏以梁, 伍少远. 南-北方汉族人、韩国人和日本人遗传划分机器学习模型优化方案[J]. 遗传, 2022, 44(11): 1028-1043. |
[3] | 胡雅丽, 戴睿, 刘永鑫, 张婧赢, 胡斌, 储成才, 袁怀波, 白洋. 水稻典型品种日本晴和IR24根系微生物组的解析[J]. 遗传, 2020, 42(5): 506-518. |
[4] | 赵学彤, 杨亚东, 渠鸿竹, 方向东. 组学时代下机器学习方法在临床决策支持中的应用[J]. 遗传, 2018, 40(9): 693-703. |
[5] | 张桂珊, 杨勇, 张灵敏, 戴宪华. 机器学习方法在CRISPR/Cas9系统中的应用[J]. 遗传, 2018, 40(9): 704-723. |
[6] | 彭哲也,唐紫珺,谢民主. 机器学习方法在基因交互作用探测中的研究进展[J]. 遗传, 2018, 40(3): 218-226. |
[7] | 侯妍妍,应晓敏,李伍举. microRNA计算发现方法的研究进展[J]. 遗传, 2008, 30(6): 687-696. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
www.chinagene.cn
备案号:京ICP备09063187号-4
总访问:,今日访问:,当前在线: