遗传

• 技术与方法 •    

融合通道与空间注意力机制的转录因子结合位点预测方法

丰继华12,陈忠兴12,康琦林12,李龙飞12,杨佳慧12,张雨亭12   

  1. 1.云南民族大学电气信息工程学院信息工程系,昆明 650504

    2.云南省无人自主系统重点实验室,昆明 650504

  • 收稿日期:2025-10-16 修回日期:2026-01-04 发布日期:2026-01-13
  • 基金资助:
    国家自然科学基金项目(编号:31160234)资助

Prediction method for transcription factor binding sites integrating channel and spatial attention mechanisms

Jihua Feng12, Zhongxing Chen12, Qilin Kang12, Longfei Li12, Jiahui Yang12, Yuting Zhang12    

  1. 1.School of Electrical and Information Engineering, Yunnan Minzu University, Kunming 650504, China

    2.Yunnan Key Laboratory of Unmanned Autonomous System, Kunming 650504, China

  • Received:2025-10-16 Revised:2026-01-04 Online:2026-01-13
  • Supported by:
    [Supported by the National Natural Science Foundation of China (No. 31160234)] 

摘要:

精准识别单核苷酸分辨率下的转录因子结合位点(transcription factor binding sites, TFBSs)是解析基因表达调控网络的核心科学问题。为改进现有计算模型在跨细胞类型预测中的性能,本研究提出一种融合通道与空间注意力机制的深度学习模型。通过系统整合10个核心转录调控因子(包括CTCFEGR1FOXA1等)在13种典型人类细胞系(涵盖A549GM12878H1-hESC等)的51组染色质免疫沉淀测序(chromatin immunoprecipitation sequencing, ChIP-seq)数据和13组脱氧核糖核酸酶I高敏感位点测序(deoxyribonuclease I hypersensitive site sequencing, DNase-seq)数据对模型进行训练与测试,结果表明,在23个测试的TF-细胞类型中表现出优异性能,平均受试者工作特征曲线下面积(area under receiver operating characteristic curve, AUROC)达到0.986,其中91%样本的AUROC超过0.970;平均精确率-召回率曲线下面积(area under precision recall curve, AUPRC)为0.169,较随机预测基线(0.000156)提升超1,000倍。相较于FactorNetLeopardDeepGRN等当前领域内具有代表性的模型,本模型在9个共有的TF-细胞类型数据集上,其AUROC均值展现出优势。可视化分析表明,模型能精准识别TF在不同细胞类型中的特异性结合位点。上述结果表明,模型为跨细胞类型的TFBSs精准预测提供了高效计算工具,有望为基因表达调控机制的深入解析及相关疾病分子机理研究提供重要支撑。

关键词:

转录因子结合位点, 注意力机制, 深度学习, 单核苷酸分辨率, 跨细胞预测

Abstract:

Accurate identification of transcription factor binding sites (TFBSs) at single-nucleotide resolution remains a central challenge in deciphering gene expression regulatory networks. To improve the performance of existing computational models for predicting TFBSs across different cell types, we present a deep learning model integrating channel and spatial attention mechanisms. In this study, we trained and tested the model using a comprehensive dataset that includes ChIP-seq data from 51 groups, involving 10 core transcription factors (e.g., CTCF, EGR1, FOXA1) across 13 human cell lines (e.g., A549, GM12878, H1-hESC), and DNase-seq data from 13 datasets. The results demonstrated that this model exhibited superior performance across 23 TF-cell type combinations, achieving a mean area under the receiver operating characteristic curve (AUROC) of 0.986, with 91% of samples yielding an AUROC above 0.970. Additionally, the mean area under the precision-recall curve (AUPRC) reached 0.169, over 1,000-fold higher than the random baseline 0.000156. When compared to state-of-the-art models in the field, such as FactorNet, Leopard, and DeepGRN, our model outperformed them in terms of AUROC on 9 shared TF-cell type datasets. Visualization analyses further confirmed that our model enables accurate identification of cell-type-specific TFBSs. This study provides an efficient computational framework for precise cross-cell-type TFBS prediction, thereby facilitating in-depth investigations into gene expression regulatory mechanisms and the molecular pathogenesis of related diseases.

Key words:

transcription factor binding sites, attention mechanism, deep learning, single-nucleotide resolution, cross-cell prediction