遗传 ›› 2024, Vol. 46 ›› Issue (8): 661-669.doi: 10.16288/j.yczz.24-102

• 技术与方法 • 上一篇    

基于层级和全局特征结合的蛋白质序列EC编号预测

杨帆1,2,3,4(), 韩巧玲1,2,3,4, 赵文迪1,2,3,4, 赵玥1,2,3,4()   

  1. 1.北京林业大学工学院,北京 100083
    2.林业装备与自动化国家林业局重点实验室,北京 100083
    3.城乡生态环境北京实验室,北京 100083
    4.北京林业大学智慧林业研究中心,北京 100083
  • 收稿日期:2024-04-12 修回日期:2024-06-18 出版日期:2024-08-20 发布日期:2024-06-19
  • 通讯作者: 赵玥,博士,教授,研究方向:人工智能、图像处理和模式识别等。E-mail: zhaoyue0609@126.com
  • 作者简介:杨帆,硕士研究生,专业方向:机器学习与数据处理。E-mail: yangfan_muyi@163.com
  • 基金资助:
    国家自然科学基金面上项目(32071838);国家自然科学基金青年科学基金项目(32101590)

EC number prediction of protein sequences based on combination of hierarchical and global features

Fan Yang1,2,3,4(), Qiaoling Han1,2,3,4, Wendi Zhao1,2,3,4, Yue Zhao1,2,3,4()   

  1. 1. School of technology, Beijing Forestry University, Beijing 100083, China
    2. Key Lab of State Forestry Administration for Forestry Equipment and Automation, Beijing 100083, China
    3. Beijing Laboratory of Urban and Rural Ecological Environment, Beijing 100083, China
    4. Research Center for Intelligent Forestry, Beijing Forestry University, Beijing 100083, China
  • Received:2024-04-12 Revised:2024-06-18 Published:2024-08-20 Online:2024-06-19
  • Supported by:
    National Natural Science Foundation of China(32071838);National Natural Science Youth Foundation of China(32101590)

摘要:

酶功能的识别对理解生命活动的机制、推进生命科学的发展有重要作用。然而现有的酶EC编号预测方法,并未充分利用蛋白质序列信息,在识别精度上仍有所不足。针对上述问题,本研究提出一种基于层级特征和全局特征的EC编号预测网络(EC number prediction network using hierarchical features and global features, ECPN-HFGF)。该方法首先通过残差网络提取蛋白质序列通用特征,并通过层级特征提取模块和全局特征提取模块进一步提取蛋白质序列的层级特征和全局特征,之后结合两种特征信息的预测结果,采用多任务学习框架,实现酶EC编号的精确预测。计算实验结果表明,ECPN-HFGF方法在蛋白质序列EC编号预测任务上性能最佳,宏观F1值和微观F1值分别达到95.5%和99.0%。ECPN-HFGF方法能有效结合蛋白质序列的层级特征和全局特征,快速准确预测蛋白质序列EC编号,比当前常用方法预测精确度更高,能够为酶学研究和酶工程应用的发展提供一种高效的思路和方法。

关键词: 酶功能预测, 蛋白质序列, 深度学习, 层级多标签分类, 全局特征

Abstract:

The identification of enzyme functions plays a crucial role in understanding the mechanisms of biological activities and advancing the development of life sciences. However, existing enzyme EC number prediction methods did not fully utilize protein sequence information and still had shortcomings in identification accuracy. To address this issue, we proposed an EC number prediction network using hierarchical features and global features (ECPN-HFGF). This method first utilized residual networks to extract generic features from protein sequences, and then employed hierarchical feature extraction modules and global feature extraction modules to further extract hierarchical and global features of protein sequences. Subsequently, the prediction results of both feature types were combined, and a multitask learning framework was utilized to achieve accurate prediction of enzyme EC numbers. Experimental results indicated that the ECPN-HFGF method performed best in the task of predicting EC numbers for protein sequences, achieving macro F1 and micro F1 scores of 95.5% and 99.0%, respectively. The ECPN-HFGF method effectively combined hierarchical and global features of protein sequences, allowing for rapid and accurate EC number prediction. Compared to current commonly used methods, this method offers significantly higher prediction accuracy, providing an efficient approach for the advancement of enzymology research and enzyme engineering applications.

Key words: enzyme function prediction, protein sequence, deep learning, hierarchical multi-label classification, global feature