PICRUSt功能预测又被爆出新的问题啦！ / 开普饭

PICRUSt功能预测又被爆出新的问题啦！

R语言分析技术
扩增子专题
基于phyloseq的微生物群落分析
代谢组专题
当科研遇见python
科学知识图谱
杂谈
背景
结果
结论
数据和材料的可用性
主要结果
https://www.biorxiv.org/content/10.1101/655746v1.full
下图表明功能预测的分辨率无法分辨相同类型不同的样品。
预测的基因同宏基因组数据进行分析发现除了人类数据重合度很高之外，其他都只有一部分可以重合
基于不同功能的基因在功能预测和宏基因之间相关性不同
欢迎加入微生信生物讨论群，扫描下方二维码添加小编微信，小编带你入伙啦，大牛如云，让交流变得简单。
PICRUSt预测精确度在样品类型和功能类别表现不同
Inference based PICRUSt accuracy varies across sample types and functional categories
历史目录

PICRUSt预测精确度在样品类型和功能类别表现不同

Inference based PICRUSt accuracy varies across sample types and functional categories

https://www.biorxiv.org/content/10.1101/655746v1.full

该文在biorxiv上已经预印。 2019.5.30

作者 Shan Sun, Roshonda B. Jones, Anthony A. Fodor

doi: https://doi.org/10.1101/655746

美国北卡罗纳州北卡罗纳大学生物信息学和基因组学系

Abstract: Background Despite recent decreases in the cost of sequencing, shotgun metagenome sequencing remains more expensive compared with 16S rRNA amplicon sequencing. Methods have been developed to predict the functional profiles of microbial communities based on their taxonomic composition, and PICRUSt is the most widely used of these techniques. In this study, we evaluated the performance of PICRUSt by comparing the significance of the differential abundance of functional gene profiles predicted with PICRUSt to those from shotgun metagenome sequencing across different environments.

Results We selected 7 datasets of human, non-human animal and environmental (soil) samples that have publicly available 16S rRNA and shotgun metagenome sequences. As we would expect based on previous literature, strong Spearman correlations were observed between gene compositions predicted with PICRUSt and measured with shotgun metagenome sequencing. However, these strong correlations were preserved even when the sample labels were shuffled. This suggests that simple correlation coefficient is a highly unreliable measure for the performance of algorithms like PICRUSt. As an alternative, we compared the performance of PICRUSt predicted genes to metagenome genes in inference models associated with metadata within each dataset. With this method, we found reasonable performance for human datasets, with PICRUSt performing better for inference on genes related to “house-keeping” functions. However, the performance of PICRUSt degraded sharply outside of human datasets when used for inference.

Conclusion We conclude that the utility of PICRUSt for inference with the default database is likely limited outside of human samples and that development of tools for gene prediction specific to different non-human and environmental samples is warranted.

背景

尽管最近，高通量测序费用不断降价，但是，相比于扩增子测序，宏基因组依然很贵。近年来，基于微生物群落组成数据预测功能图谱的方法也在逐步发展。PICRUSt是其中最具代表性的一个方法，使用的也很广泛。在本研究中，我们通过比对宏基因组和扩增子通过 PICRUSt预测得到的功能丰度进行差异分析来评估PICRUSt的表现。

结果

我们在公开的数据库中选择了七组样品，均含有扩增子和宏基因组数据，这七组样品包含人类，非人类动物和环境样品这三大类。之前有相关文章表明的基于预测得到功能同宏基因组数据存在很强的斯皮尔曼相关，在本研究中也是这样。但是我们将样品打乱重新做相关也表现出很强的相关性。这表明了使用简单的相关性去衡量类似PICRUSt功能预测工具的可靠性是不够的。作为替代方案，我们通过功能类别分类，并比对宏基因组同扩增子PICRUSt预测基因的相关性。这种方法对人类样品可以很好的预测，尤其是管家基因相关的基因，但是就其他非人类数据集来看，PICRUSt的预测性能急剧下降。

结论

所以，基于PICRUSt默认的数据库预测人类样品是可靠的，但是仅限于人类样品。但是基于其他样品尤其是环境相关的样品的功能预测工具很有必要继续开发。

数据和材料的可用性

本研究中分析的数据集可公开获得，表S1中列出了存储库和登录号。本研究中使用的R脚本可在Github上获得（https://github.com/ssun6/Inference_picrust）

主要结果

下图表明功能预测的分辨率无法分辨相同类型不同的样品。

我们使用土壤样品做功能预测，如果此时有宏基因组样品，通过斯皮尔曼相关会发现相关性很强。这片文章表明的是这种很强的相关不可信。假如功能预测数据使用的是另外一批土壤样品预测的基因数据做相关，同样相关性也很高。

我们可以理解，这种情况大概就是数据库分辨率不够的原因了。我们可以使用作者的数据置换不同类型的数据再次做相关，试试不同类型数据之间相关性变化。因为现在让人怀疑的不仅仅是同种类型数据无法区分，不同类型预测结果之间是否能不能区分可能都是一个问题。这个可以使用作者提供的数据集来做一下置换。有时间试试。

预测的基因同宏基因组数据进行分析发现除了人类数据重合度很高之外，其他都只有一部分可以重合

尤其是土壤样品，功能预测无法预测的功能基因有30%以上，并且有60%以上的基因原样品中没有检测到，但是通过功能预测却得到了，这代表了很严重的假阳性。

基于不同功能的基因在功能预测和宏基因之间相关性不同

思考：如果说picrust的预测在土壤样品中存在大的问题，那么时候可以开发出专注于土壤的物种同功能的数据库，这可能是提高预测准确度的方法。总之复杂环境样品的功能预测还有很长的路要走。

PICRUSt功能预测又被爆出新的问题啦！