遗传

• 技术与方法 •    

基于染色体编码的多头自注意力模型进行大白猪生长性状表型的基因组预测

周文譞1,2,3,赵真坚1,2,3,陈栋1,2,3,崔晟頔1,2,3,王俊戈1,2,3,陈子旸1,2,3,禹世欣1,2,3,陈佳苗1,2,3,周垚茜1,2,3,黄润杰1,2,3,唐国庆1,2,3   

  1. 1.四川农业大学动物科技学院,猪禽种业全国重点实验室,成都 611130

    2.四川农业大学动物科技学院,农业农村部畜禽生物组学重点实验室,成都 611130

    3.四川农业大学,禽遗传资源发掘与创新利用四川省重点实验室,成都 611130
  • 发布日期:2026-01-12
  • 基金资助:
    国家生猪技术创新中心先导科技项目(编号:NCTIP-XD/B01),四川省科技厅项目(编号:2020YFN0024,2021ZDZX0008,2021YFYZ0030)和四川省猪创新团队项目(编号:sccxtd-2022-08)资助

Genomic prediction of growth trait phenotypes in Large White pigs using a chromosome-encoded multi-head self-attention model

Wenxuan Zhou1,2,3, Zhenjian Zhao1,2,3, Dong Chen1,2,3, Shengdi Cui1,2,3, Junge Wang1,2,3, Ziyang Chen1,2,3, Shixin Yu1,2,3, Jiamiao Chen1,2,3, Yaoxi Zhou 1,2,3, Runjie Huang1,2,3, Guoqing Tang 1,2,3   

  1. 1.State Key Laboratory of Swine and Poultry Breeding Industry, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China

    2.Key Laboratory of Livestock and Poultry Multi-omics of Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China

    3.Farm Animal Genetic Resources Exploration and Innovation Key Laboratory of Sichuan Province, Sichuan Agricultural University, Chengdu 611130, China
  • Online:2026-01-12
  • Supported by:
    Supported by the National Pig Technology Innovation Center Pioneering Science and Technology Project (No. NCTIP-XD/B01), Sichuan Provincial Department of Science and Technology Project (Nos. 2020YFN0024, 2021ZDZX0008, 2021YFYZ0030) and  Sichuan Pig Innovation Team (No. sccxtd-2022-08)

摘要: 随着基因组测序技术的普及,利用基因组标记预测复杂性状已成为育种关键。然而,基因组数据高维稀疏及其内部遗传标记间复杂的非线性交互特性,极大提高了精准数据分析的难度与硬件部署成本。因此本研究提出了一种基于染色体编码的多头自注意力模型(multi-head self-attention model)——ChrFormer进行基因组预测。该模型采用染色体编码器将全基因组SNP数据压缩为20个染色体特征向量和1个全局特征向量,利用多头自注意力机制动态捕获跨染色体的长程互作效应,最终通过多层感知机(multilayer perceptron,MLP)实现从基因组特征到表型的精准预测。本研究选取4,875头大白猪50K SNP基因分型数据以及4项重要生产性状(100 kg和115 kg背膘厚、100 kg和115 kg日龄)作为研究对象,采用十折交叉验证方法,以皮尔逊相关系数作为评价指标,系统比较了ChrFormer与基因组最佳线性无偏预测(genomic best linear unbiased prediction,GBLUP)、贝叶斯方法A(BayesA)和典型深度学习方法——视觉几何组(visual geometry group,VGG)、前馈神经网络(feedforward neural network,FNN)的预测性能;并且从模型参数量、训练耗时和过拟合程度等方面分析各深度学习模型的优劣。结果显示,ChrFormer在所有测试性状上的预测精度均显著优于VGG和FNN深度学习模型。在100 kg背膘、115 kg背膘和115 kg日龄这3个性状上,其预测准确度超越了传统的GBLUP和BayesA方法。虽然ChrFormer的单次迭代训练时间较长(54.88 s),但模型参数量仅约为VGG和FNN的1/10,且表现出更稳定的抗过拟合特性。本研究验证了自注意力机制的ChrFormer模型在猪生长性状表型的基因组预测的实用性,其轻量化的架构特点和稳定的预测性能,为计算资源有限的育种场开展表型的精准预测提供了切实可行的技术方法。

关键词: 多头自注意力网络, 基因组预测, 深度学习, 大白猪

Abstract: With the widespread adoption of genome sequencing technologies, predicting complex traits using genomic markers has become a key component in breeding programs. However, the high dimensionality and sparsity of genomic data, along with the complex nonlinear interactions among genetic markers, significantly increase the difficulty of accurate data analysis and the cost of hardware deployment. Therefore, this study proposes a chromosome-encoded multi-head self-attention model, named ChrFormer, for genomic prediction. The model employs a chromosome encoder to compress whole-genome SNP data into 20 chromosome-specific feature vectors and one global feature vector. It leverages the multi-head self-attention mechanism to dynamically capture long-range interactive effects across chromosomes, and a multilayer perceptron (MLP) precisely predicts phenotype from the refined genomic features. The study selected genotyping data from 50,000 SNPs of 4,875 Large White pigs, along with four key production traits, including backfat thickness at 100 kg and 115 kg, and age at 100 kg and 115 kg. A ten-fold cross-validation approach and the Pearson correlation coefficient were used to evaluate prediction accuracy. The predictive performance of ChrFormer was systematically compared with genomic best linear unbiased prediction (GBLUP), Bayesian method A (BayesA), and representative deep learning methods, including the visual geometry group (VGG) network and the feedforward neural network (FNN). Furthermore, the study analyzed the strengths and weaknesses of each deep learning model from multiple aspects, including the number of model parameters, training time, and the extent of overfitting. The results show that ChrFormer significantly outperforms the VGG and FNN deep learning models in predictive accuracy across all tested traits. For three of the traits (backfat thickness at 100 kg and 115 kg, and days to 115 kg), its prediction accuracy surpasses that of the traditional GBLUP and BayesA methods. Although ChrFormer requires a longer training time per iteration (54.88 s), its number of parameters is only about one-tenth of that of VGG and FNN, and it demonstrates more stable resistance to overfitting. These results demonstrate that the self-attention-based ChrFormer is a practical tool for genomic phenotype prediction in animal breeding, and its lightweight architecture and stable performance offer a readily deployable solution for breeding stations with limited computational resources. 

Key words: multi-head self-attention network, genomic prediction, deep learning, Large White pigs