遗传

• 研究报告 •    

基于宏基因组鸟枪测序的中国典型城市灰尘地域推断研究

杨琪1, 2康克莱2赵博3, 4冯凯3, 4冯耀森2叶健1, 2邓晔3, 4王乐2   

  1. 1. 中国人民公安大学,北京100038

    2. 法医遗传学公安部重点实验室,公安部鉴定中心,北京100038

    3. 中国科学院生态环境研究中心,中国科学院环境生物技术重点实验室,北京 100085

    4. 中国科学院大学资源与环境学院,北京 100049


  • 收稿日期:2025-02-12 修回日期:2025-03-20 出版日期:2025-03-21 发布日期:2025-03-21
  • 基金资助:
    公安部科技强警基础工作专项;公安部鉴定中心基本科研业务费项目

Geographical inference of dust from typical Chinese cities based on metagenomic shotgun sequencing

Qi Yang1,2, Kelai Kang2, Bo Zhao3, 4, Kai Feng3, 4, Yaosen Feng2, Jian Ye1, 2, Ye Deng3,4, Le Wang2   

  • Received:2025-02-12 Revised:2025-03-20 Published:2025-03-21 Online:2025-03-21

摘要: 灰尘中的微生物信息与地理位置密切相关,能为侦查破案提供线索,在法庭科学领域具有重要应用价值。然而,利用宏基因组数据集中微生物群落特征推断地理位置的可行性尚未得到充分探索。本研究从中国北部、东部、西南部和西北部四个具有明显地理和气候差异的城市中采集了170份城市住宅小区的环境灰尘样本,并对所有样本进行宏基因组鸟枪测序,以揭示微生物组成的差异。共注释获得41,029个物种,其中细菌占93.39%,真核生物占6.37%,并发现少量的病毒(0.21%)和古菌(0.03%)。结果表明,四个城市之间的微生物群落组成存在显著差异,这些差异可以实现四个城市环境样本的有效区分(R2 = 0.870,P<0.001)。通过过滤所有样本中检出率低于10%的物种,进一步提高了城市间的区分效果(R2 = 0.948,P<0.001),并筛选出127个具有城市代表性的差异物种。每个城市都拥有独特的微生物群落,包括特有物种和相对丰度较高的分类单元,这些特征共同构成了城市特有的微生物图谱。所有样本按7:3的比例随机分为训练集和测试集。通过SourceTracker、FEAST、LightGBM、随机森林(random forest)和支持向量机(support vector machine, SVM)五种机器学习模型对51个随机测试集来模拟预测未知来源的环境样本地理分区,平均准确率分别达到了88.89%、92.16%、98.04%、99.35%、69.28%。这些结果构成了中国四个城市的微生物遗传图谱,突出了不同城市微生物分类特征的显著差异,并为城市尺度的灰尘样本溯源提供了一种方法。

关键词: 灰尘, 宏基因组鸟枪测序, 微生物组成, 地理推断

Abstract:

Microbial profiles in dust are closely correlated with geographicalocations provide valuable cluefor criminal investigationdemonstrating significant potential in forensic use. However, the feasibility of using microbial profiles from metagenomics datasets to infer the geographical locations remains underexplored. In this study, we collect 170 dust samples from resident communities in four cities across northern, eastern, southwestern, and northwestern China. All samples are subjected to shotgun metagenomic sequencing to reveal variations in microbial composition. In total, 41,029 species are annotated, including 93.39% bacteria, 6.37% eukaryotes, 0.21% viruses, and 0.03% archaea. Clear clustering patterns are observed among the four cities (R2 = 0.870, P<0.001). Further filtering of species with detection rates below 10% across all samples strengthens city-level clustering (R2 = 0.948, P<0.001). Additionally, 127 biomarkers are identified using linear discriminant analysis effect size (LEfSe) to distinguish between the cities . Each city harbors a distinct microbial community, with unique species and relatively abundant taxa that contribute to its differentiated microbial profile. All samples are randomly split into training and testing sets in a 7:3 ratio. Five machine learning models including SourceTracker, FEAST, LightGBM, Random Forest and Support Vector Machine are applied to 51 randomly sample data and achieve average accuracies of 88.89%, 92.16%, 98.04%, 99.35% and 69.28%, respectively. These results constitute a microbial genetic map of four cities in China that highlights distinct microbial taxonomic signatures and provides an approach for city-scale source tracking of dust samples.