phyloseq入门
翻译:文涛
写在前面: 最近一段时间面临着各种各样的问题和挑战,总在寻求一种可以权衡,理解,解释,解决之道。
phyloseq:使用R语言分析微生物群落(microbiome census data) 目前对微生物群落的分析有许多挑战:使用生态学,遗传学,系统发育学,网络分析等方法对不同类型的微生物群落数据进行整合,可视化分析(visualization and testing)。微生物群落数据本身可能来源广泛,例如:人体微生物,土壤,海面及其海水,污水处理厂,工业设施等; 因此,不同来源微生物群落就其实验设计和科学问题可能有非常大的差异。phyloseq软件包是一个对在聚类OTU后的微生物群落(数据包括:OTU表格,系统发育树,OTU注释信息)进行下游系统发育等分析的综合性工具,集成了微生物群落数据导入,存储,分析和出图等功能。该软件包利用R中的许多工具进行生态学和系统发育分析(vegan, ade4, ape, picante),同时还使用先进/灵活的图形系统(ggplot2)轻松的对系统发育数据绘制发表级别的图表。phyloseq使用S4类的系统将所有相关的系统发育测序数据存储为单个实验级对象,从而更容易共享数据并重现分析。通常,phyloseq寻求促进使用R进行OTU聚类的高通量系统发育测序数据的有效探索和可重复分析。
具体地说,phyloseq具体功能如下:
从流行的Denoising / OTU集群管道导入丰度和相来自去噪 或者一下OTU聚类流程的结果(DADA2,UPARSE,QIIME,mothur,BIOM,PyroTagger,RDP等);
打包了了微生物群落常规分析;
支持44种距离的方法(UniFrac,Jensen-Shannon等);
排序->支持多种方法,包括约束排序方法;
基于群落数据发的相关绘图使用ggplot2进行强大,灵活的探索性分析;
模块化,可定制的分析过程,完全支持可重复的工作。
具有基于OTU /样本进行数据合并,以及支持手动导入的数据的功能。
UniFrac距离的多线程并行计算。
针对高通量扩增子测序数据的多种方法的尝试。
使用真实发表数据进行分析和绘图的案例。
phyloseq包在GitHub上是开源的: https://github.com/joey711/phyloseq
作者尽力在官方文档引用以及其他任何合适的地方突出为此包贡献力量的作者。所以请随意分享并做出贡献!
phyloseq包安装
#可选
source('http://bioconductor.org/biocLite.R')
biocLite('phyloseq')
#开发
library("devtools")
install_github("phyloseq/joey711")
#版本信息
packageVersion('phyloseq')
library("phyloseq")
library("phyloseq"); packageVersion("phyloseq")
library("ggplot2"); packageVersion("ggplot2")
#设置默认绘图主题,ggplot默认不是这个
theme_set(theme_bw())
导入文件
使用?import便可查看全部数据导入说明及其案例。目前各大主流格式的数据均可导入phyloseq;
BIOM: http://www.biom-format.org/
mothur: http://www.mothur.org/wiki/Main_Page
PyroTagger: http://pyrotagger.jgi-psf.org/
QIIME: http://qiime.org/
RDP pipeline: http://pyro.cme.msu.edu/index.jsp
使用对应的工具
BIOM: import_biom;
mothur: import_mothur和辅助工具 show_mothur_cutoffs;import_mothur_dist export_mothur_dist;
PyroTagger: import_pyrotagger_tab;
QIIME : import_qiime;
RDP: import_RDP_cluster。 如下是phyloseq数据存储和操作的原理示意图:
用官网的话说就是:如果你可以将数据导入R中,就可以将其导入phyloseq。主要有五种数据类型,分别是以矩阵形式的otu table和tax(物种注释)文件;数据框形式的mapping文件;另外还有树文件和代表序列文件
数据导入(implot)
构造otu_table
# Create a pretend OTU table that you read from a file, called otumat
otumat = matrix(sample(1:100, 100, replace = TRUE), nrow = 10, ncol = 10)
otumat
rownames(otumat) <- paste0("OTU", 1:nrow(otumat))
colnames(otumat) <- paste0("Sample", 1:ncol(otumat))
otumatSample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10
OTU1 58 41 93 2 69 4 30 18 81 37
OTU2 45 98 80 96 41 87 74 13 64 41
OTU3 86 11 70 70 82 3 57 43 87 51
OTU4 4 23 22 59 82 82 51 83 40 38
OTU5 64 10 3 58 12 68 25 49 75 78
OTU6 3 85 19 26 19 41 58 21 97 97
OTU7 42 7 17 84 43 22 15 89 4 73
OTU8 13 78 55 65 20 91 5 64 35 43
OTU9 55 89 65 86 44 55 8 82 80 2
OTU10 59 35 8 7 89 62 100 17 34 58
构造tax文件
taxmat = matrix(sample(letters, 70, replace = TRUE), nrow = nrow(otumat), ncol = 7)
rownames(taxmat) <- rownames(otumat)
colnames(taxmat) <- c("Domain", "Phylum", "Class", "Order", "Family", "Genus", "Species")
taxmatDomain Phylum Class Order Family Genus Species
OTU1 "z" "b" "p" "p" "o" "o" "k"
OTU2 "e" "q" "c" "s" "g" "z" "m"
OTU3 "j" "g" "p" "g" "w" "r" "s"
OTU4 "o" "w" "h" "x" "p" "p" "k"
OTU5 "m" "d" "i" "c" "g" "r" "l"
OTU6 "z" "v" "r" "f" "c" "x" "r"
OTU7 "r" "k" "g" "j" "i" "g" "i"
OTU8 "e" "h" "m" "m" "q" "e" "m"
OTU9 "t" "u" "r" "f" "p" "x" "d"
OTU10 "b" "o" "w" "b" "b" "k" "e"
otu_table和tax_table函数构造phyloseq对象并加入phyloseq
OTU = otu_table(otumat, taxa_are_rows = TRUE)
TAX = tax_table(taxmat)
OTU
TAX
#加入phyloseq体系
physeq = phyloseq(OTU, TAX)
physeq
格式文件
OTU Table: [10 taxa and 10 samples]
taxa are rows
Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10
OTU1 58 41 93 2 69 4 30 18 81 37
· · · · · · ·
Taxonomy Table: [10 taxa by 7 taxonomic ranks]:
Domain Phylum Class Order Family Genus Species
OTU1 "z" "b" "p" "p" "o" "o" "k"
· · · · · · ·
phyloseq-class experiment-level object
otu_table() OTU Table: [ 10 taxa and 10 samples ]
tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
· · · · · · ·
以科水平做柱状图
默认展示的是count数量
plot_bar(physeq, fill = "Family")
构建样品及其分组文件(mapping)
sampledata = sample_data(data.frame(
Location = sample(LETTERS[1:4], size=nsamples(physeq), replace=TRUE),
Depth = sample(50:1000, size=nsamples(physeq), replace=TRUE),
row.names=sample_names(physeq),
stringsAsFactors=FALSE
))
sampledataSample Data: [10 samples by 2 sample variables]:
Location Depth
Sample1 C 137
Sample2 B 901
Sample3 D 307
Sample4 A 200
Sample5 B 503
Sample6 B 977
Sample7 C 506
Sample8 B 226
Sample9 B 849
Sample10 A 220
构建随机树合并phyloseq
library("ape")
random_tree = rtree(ntaxa(physeq), rooted=TRUE, tip.label=taxa_names(physeq))
plot(random_tree)
physeq1 = merge_phyloseq(physeq, sampledata, random_tree)
physeq1
#可重新构建元素
physeq2 = phyloseq(OTU, TAX, sampledata, random_tree)
physeq2
#两者是否相同,很明显是一样的
identical(physeq1, physeq2)
使用phyloseq展示进化树和count丰度热图
lot_tree(physeq1, color="Location", label.tips="taxa_names", ladderize="left", plot.margin=0.3)
plot_tree(physeq1, color="Depth", shape="Location", label.tips="taxa_names", ladderize="right", plot.margin=0.3)
plot_heatmap(physeq1)
plot_heatmap(physeq1, taxa.label="Phylum")
以上部分我们对phyloseq对象及其函数作了一个初步的学习,数据导入部分还有biom格式的较为常见,如有需求,请查看:http://joey711.github.io/phyloseq/import-data.html
boim格式文件导入
#基于biom格式数据导入
rich_dense_biom = system.file("extdata", "rich_dense_otu_table.biom", package="phyloseq")
rich_sparse_biom = system.file("extdata", "rich_sparse_otu_table.biom", package="phyloseq")
min_dense_biom = system.file("extdata", "min_dense_otu_table.biom", package="phyloseq")
min_sparse_biom = system.file("extdata", "min_sparse_otu_table.biom", package="phyloseq")
treefilename = system.file("extdata", "biom-tree.phy", package="phyloseq")
refseqfilename = system.file("extdata", "biom-refseq.fasta", package="phyloseq")
import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)
myData = import_biom(rich_dense_biom, treefilename, refseqfilename, parseFunction=parse_taxonomy_greengenes)
myData
sample_data(myData)
plot_tree(myData, color="Genus", shape="BODY_SITE", size="abundance")
plot_richness(myData, x="BODY_SITE", color="Description")plot_bar(myData, fill="Genus")
refseq(myData)
A DNAStringSet instance of length 5
width seq names
[1] 334 AACGTAGGTCACAAGCGTTGTCCGGAATTACTGGGTGTAAAGG...ACTAGGTGTGGGGGTCTGACCCCTTCCGTGCCGGAGTTAACAC GG_OTU_1
[2] 465 TACGTAGGGAGCAAGCGTTATCCGGATTTATTGGGTGTAAAGG...TTAATTCGAAGCAACGCGAAGAACCTTACCAGGGCTTGACATA GG_OTU_2
[3] 249 TACGTAGGGGGCAAGCGTTATCCGGATTTACTGGGTGTAAAGG...TACTGGACTGTAACTGACGTTGAGGCTCGAAAGCGTGGGGAGC GG_OTU_3
[4] 453 TACGTATGGTGCAAGCGTTATCCGGATTTACTGGGTGTAAAGG...AGCGGTGAGCATGTGGTTAATCGAAGCAACGCGAAGAACCTTA GG_OTU_4
[5] 178 AACGTAGGGTGCAAGCGTTGTCCGGAATTACTGGGTGTAAAGG...GATGGAATTCCCGGTGTAGCGGTGGAATGCGTAGATATCGGGA GG_OTU_5
基于qiime输出文件导入
otufile = system.file("extdata", "GP_otu_table_rand_short.txt.gz", package="phyloseq")
mapfile = system.file("extdata", "master_map.txt", package="phyloseq")
trefile = system.file("extdata", "GP_tree_rand_short.newick.gz", package="phyloseq")
rs_file = system.file("extdata", "qiime500-refseq.fasta", package="phyloseq")
qiimedata = import_qiime(otufile, mapfile, trefile, rs_file)
qiimedataplot_bar(qiimedata, x="SampleType", fill="Phylum")
基于其他格式的文件导入参见官网教程:https://joey711.github.io/phyloseq/import-data.html
使用phyloseq的例子
#使用phyloseq
data(GlobalPatterns)
data(esophagus)
data(enterotype)
data(soilrep)?GlobalPatterns
example(enterotype, ask=FALSE)
?make_network
ig <- make_network(enterotype, "samples", max.dist=0.3)
plot_network(ig, enterotype, color="SeqTech", shape="Enterotype", line_weight=0.3, label=NULL)
phyloseq按照一定要求合并数据
个人认为phyloseq可以很好的对数据进行清洗,并且连带关系处理的不错 otu丰度信息在合并的OTU表格中是相加的关系
##phyloseq合并数据
data(GlobalPatterns)
GP = GlobalPatterns
#数据过滤,保留count大于0的OTU
GP = prune_taxa(taxa_sums(GlobalPatterns) > 0, GlobalPatterns)
humantypes = c("Feces", "Mock", "Skin", "Tongue")
sample_data(GP)$human <- get_variable(GP, "SampleType") %in% humantypes
#按照SampleType合并样品
mergedGP = merge_samples(GP, "SampleType")
SD = merge_samples(sample_data(GP), "SampleType")
print(SD[, "SampleType"])
sample_data(mergedGP)OTUnames10 = names(sort(taxa_sums(GP), TRUE)[1:10])
GP10 = prune_taxa(OTUnames10, GP)
mGP10 = prune_taxa(OTUnames10, mergedGP)
ocean_samples = sample_names(subset(sample_data(GP), SampleType=="Ocean"))
print(ocean_samples)
otu_table(GP10)[, ocean_samples]> otu_table(GP10)[, ocean_samples]
OTU Table: [10 taxa and 3 samples]
taxa are rows
NP2 NP3 NP5
329744 91 126 120
317182 3148 12370 63084
549656 5045 10713 1784
279599 113 114 126
360229 16 83 786
94166 49 128 709
550960 11 86 65
158660 13 39 28
331820 24 101 105
189047 4 33 29rowSums(otu_table(GP10)[, ocean_samples])
otu_table(mGP10)["Ocean", ]> otu_table(mGP10)["Ocean", ]
OTU Table: [10 taxa and 1 samples]
taxa are columns
329744 317182 549656 279599 360229 94166 550960 158660 331820 189047
Ocean 337 78602 17542 353 885 886 162 80 230 66plot_richness(GP, "human", "SampleType", title="unmerged")
sample_data(mergedGP)$SampleType = sample_names(mergedGP)
sample_data(mergedGP)$human = sample_names(mergedGP) %in% humantypes
plot_richness(mergedGP, "human", "SampleType", title="merged")
根据进化关系对OTU进行合并
load("example-data.RData")
plot_tree(closedps, color="Treatment", size="abundance", sizebase=2, label.tips="taxa_names")
x1 = merge_taxa(closedps, taxa_names(closedps)[3:27], 2)
plot_tree(x1, color="Treatment", size="abundance", sizebase=2, label.tips="taxa_names")
基于phyloseq数据合并
#合并phyloseq对象
data(GlobalPatterns)
tree = phy_tree(GlobalPatterns)
tax = tax_table(GlobalPatterns)
otu = otu_table(GlobalPatterns)
sam = sample_data(GlobalPatterns)
otutax = phyloseq(otu, tax)
otutax
GP2 = merge_phyloseq(otutax, sam, tree)
identical(GP2, GlobalPatterns)
##更加复杂的方式
otusamtree = phyloseq(otu, sam, tree)
GP3 = merge_phyloseq(otusamtree, otutax)
GP3
identical(GP3, GlobalPatterns)
GP4 = merge_phyloseq(otusamtree, tax_table(otutax))
GP4
identical(GP4, GlobalPatterns)
第三部分 访问数据和数据预处理
phyloseq系统内数据查看
##访问数据及其数据处理
data("GlobalPatterns")
GlobalPatterns
#查看OTU数量
ntaxa(GlobalPatterns)
#查看样品数量
nsamples(GlobalPatterns)
#样品名查看,此处查看前五个样品名称
sample_names(GlobalPatterns)[1:5]
#查看tax的分类等级信息
rank_names(GlobalPatterns)
#查看mapping文件表头(样品信息文件列名)
sample_variables(GlobalPatterns)
#查看部分OTU表格矩阵
otu_table(GlobalPatterns)[1:5, 1:5]
#查看部分注释(tax)文件矩阵
tax_table(GlobalPatterns)[1:5, 1:4]
#进化树查看,注意不是可视化
phy_tree(GlobalPatterns)
#查看OTU名称,此处查看前十个
taxa_names(GlobalPatterns)[1:10]
#按照丰度提取前十个OTU,并可视化进化树
myTaxa = names(sort(taxa_sums(GlobalPatterns), decreasing = TRUE)[1:10])
ex1 = prune_taxa(myTaxa, GlobalPatterns)
plot(phy_tree(ex1), show.node.label = TRUE)
数据的预处理
#数据预处理
#转化OTUcount数为相对丰度
GPr = transform_sample_counts(GlobalPatterns, function(x) x / sum(x) )
otu_table(GPr)[1:5][1:5]
#提取丰度大于十万份之一的OTU
GPfr = filter_taxa(GPr, function(x) mean(x) > 1e-5, TRUE)
#提取指定分类的OTU
GP.chl = subset_taxa(GlobalPatterns, Phylum=="Chlamydiae")
#提取总count数量大于20的样品
GP.chl = prune_samples(sample_sums(GP.chl)>=20, GP.chl)
#合并指定OTU: taxa_names(GP.chl)[1:5]为一个OTU
GP.chl.merged = merge_taxa(GP.chl, taxa_names(GP.chl)[1:5])
# gpsfbg = tax_glom(gpsfb, "Family")
# plot_tree(gpsfbg, color="SampleType", shape="Class", size="abundance")
transform_sample_counts(GP.chl, function(OTU) OTU/sum(OTU) )
#去除至少20%样本中未见过3次以上的OTU,注意编写函数的形式
GP = filter_taxa(GlobalPatterns, function(x) sum(x > 3) > (0.2*length(x)), TRUE)
#对样品分组文件mapping添加一列
sample_data(GP)$human = factor( get_variable(GP, "SampleType") %in% c("Feces", "Mock", "Skin", "Tongue") )
#测序深度进行统一
total = median(sample_sums(GP))
standf = function(x, t=total) round(t * (x / sum(x)))
gps = transform_sample_counts(GP, standf)
sample_sums(gps)
#变异系数过滤OTU,减少一些意外OTU误差
gpsf = filter_taxa(gps, function(x) sd(x)/mean(x) > 3.0, TRUE)
#提取指定门类OTU
gpsfb = subset_taxa(gpsf, Phylum=="Bacteroidetes")
title = "plot_bar; Bacteroidetes-only"
plot_bar(gpsfb, "SampleType", "Abundance", title=title)
基于微生物群落数据距离的计算
##基于phyloseq的微生物群落聚类的计算
library("plyr"); packageVersion("plyr")
data(enterotype)
enterotype <- subset_taxa(enterotype, Genus != "-1")
#我们进行查看发现有很多的距离算法
dist_methods <- unlist(distanceMethodList)
print(dist_methods)
#删除需要tree的距离算法
# Remove them from the vector
dist_methods <- dist_methods[-(1:3)]
# 删除需要用户自定义的距离算法
dist_methods["designdist"]
dist_methods = dist_methods[-which(dist_methods=="ANY")]
plist <- vector("list", length(dist_methods))
names(plist) = dist_methods
for( i in dist_methods ){
# Calculate distance matrix
iDist <- distance(enterotype, method=i)
# Calculate ordination
iMDS <- ordinate(enterotype, "MDS", distance=iDist)
## Make plot
# Don't carry over previous plot (if error, p will be blank)
p <- NULL
# Create plot, store as temp variable, p
p <- plot_ordination(enterotype, iMDS, color="SeqTech", shape="Enterotype")
# Add title to each plot
p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
# Save the graphic to file.
plist[[i]] = p
}
df = ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p = ggplot(df, aes(Axis.1, Axis.2, color=SeqTech, shape=Enterotype))
p = p + geom_point(size=3, alpha=0.5)
p = p + facet_wrap(~distance, scales="free")
p = p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p
df = ldply(plist, function(x) x$data)
names(df)[1] <- "distance"
p = ggplot(df, aes(Axis.1, Axis.2, color=Enterotype, shape=SeqTech))
p = p + geom_point(size=3, alpha=0.5)
p = p + facet_wrap(~distance, scales="free")
p = p + ggtitle("MDS on various distance metrics for Enterotype dataset")
p
#jsd距离查看效果
print(plist[["jsd"]])
print(plist[["jaccard"]])
print(plist[["bray"]])
差异统计
library("phyloseq"); packageVersion("phyloseq")
library("cluster"); packageVersion("cluster")
library("ggplot2"); packageVersion("ggplot2")
##统计差异
theme_set(theme_bw())
# Load data
data(enterotype)
#jsd距离进行MDS排序
exord = ordinate(enterotype, method="MDS", distance="jsd")
pam1 = function(x, k){list(cluster = pam(x,k, cluster.only=TRUE))}
x = phyloseq:::scores.pcoa(exord, display="sites")
# gskmn = clusGap(x[, 1:2], FUN=kmeans, nstart=20, K.max = 6, B = 500)
gskmn = clusGap(x[, 1:2], FUN=pam1, K.max = 6, B = 50)
gap_statistic_ordination = function(ord, FUNcluster, type="sites", K.max=6, axes=c(1:2), B=500, verbose=interactive(), ...){
require("cluster")
# If "pam1" was chosen, use this internally defined call to pam
if(FUNcluster == "pam1"){
FUNcluster = function(x,k) list(cluster = pam(x, k, cluster.only=TRUE))
}
# Use the scores function to get the ordination coordinates
x = phyloseq:::scores.pcoa(ord, display=type)
# If axes not explicitly defined (NULL), then use all of them
if(is.null(axes)){axes = 1:ncol(x)}
# Finally, perform, and return, the gap statistic calculation using cluster::clusGap
clusGap(x[, axes], FUN=FUNcluster, K.max=K.max, B=B, verbose=verbose, ...)
}
plot_clusgap = function(clusgap, title="Gap Statistic calculation results"){
require("ggplot2")
gstab = data.frame(clusgap$Tab, k=1:nrow(clusgap$Tab))
p = ggplot(gstab, aes(k, gap)) + geom_line() + geom_point(size=5)
p = p + geom_errorbar(aes(ymax=gap+SE.sim, ymin=gap-SE.sim))
p = p + ggtitle(title)
return(p)
}
gs = gap_statistic_ordination(exord, "pam1", B=50, verbose=FALSE)
print(gs, method="Tibs2001SEmax")
plot_clusgap(gs)
plot(gs, main = "Gap statistic for the 'Enterotypes' data")
mtext("Looks like 4 clusters is best, with 3 and 5 close runners up.")