差异分析及可视化 / 开普饭

课程笔记

粉丝：有单细胞线上课程吗？

小编：什么？我们的单细胞转录组分析线上课程已经上线好久了，你们竟然都不知道吗，每篇推文后面的课程推荐没人看的吗，小编已哭晕在厕所

好了，戏演完了，下面郑重介绍下我们的单细胞线上课程：（详情戳下方链接）

全网第二个单细胞视频课程预售

这个课程笔记栏目记录了学员们学习单细胞转录组课程的学习笔记

希望大家能有所收获！

作者 | 单细胞天地小编刘小泽

课程链接在：http://jm.grazy.cn/index/mulitcourse/detail.html?cid=55
第二单元第9讲：细胞亚群之间的差异分析

前言

这次的任务是模仿原文的：

第五张：比较两种CD8+细胞差异

在分群结果可以看到，CD8+主要分成了两群，一个是红色的（170个CD8+ cytotoxic T cells，即细胞毒性T细胞），一个是浅蓝色的（429个CD8+ effector T cells，即效应T细胞）

【图片注释：Heat map of selected significantly differentially expressed genes comparing CD8+ T cells in the red activated cluster (n = 170) to those in the blue effector/EM cluster (n = 429) at response (day + 376) 】

首先进行差异分析

准备数据

=> SubsetData()取子集

既然是分析的day +376数据，那么就先把这一小部分数据提取出来：

1rm(list = ls()) 2options(warn=-1) 3suppressMessages(library(Seurat)) 4load('./patient1.PBMC.output.Rdata') 5 6PBMC_RespD376 = SubsetData(PBMC,TimePoints =='PBMC_RespD376') 7> table(PBMC_RespD376@ident) 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 10800 433 555 677 636 516 119 324 204 200 170 11 39

然后这个图是分析了红色和浅蓝色的两群，结合之前得到的分群结果，红色是第10群，浅蓝色是第4群

于是再用这个函数提取出来第4、10群

1PBMC_RespD376_for_DEG = SubsetData(PBMC_RespD376, 2 PBMC_RespD376@ident %in% c(4,10))

利用monocle V2构建对象

=> newCellDataSet()

我们需要三样东西：表达矩阵、细胞信息、基因信息

首先来看表达矩阵：上面取到的PBMC_RespD376_for_DEG中包含了多个数据接口，单是表达矩阵相关就有三个

关于这三者的不同：

We view object@raw.data as a storage slot for the non-normalized data, upon which we perform normalization and return object@data.（来自：https://github.com/satijalab/seurat/issues/351）

也就是说，raw.data是最原始的矩阵，data是归一化之后的，scale.data是标准化后的（一般是z-score处理）

data：The normalized expression matrix (log-scale)

scale.data ：scaled (default is z-scoring each gene) expression matrix; used for dimmensional reduction and heatmap visualization

（来自：seurat的各个接口含义）

我们使用log处理过的归一化表达矩阵

1count_matrix=PBMC_RespD376_for_DEG@data 2> dim(count_matrix) 3[1] 17712 806

然后，看细胞信息

1# 细胞分群信息 2cluster=PBMC_RespD376_for_DEG@ident 3> table(cluster) 4cluster 5 4 10 6636 170 7# 细胞名称（barcode名称） 8names(count_matrix)

最后是基因信息

1gene_annotation <- as.data.frame(rownames(count_matrix))

万事俱备，开始monocle

1library(monocle) 2> packageVersion('monocle') 3[1] '2.12.0’ 4 5# 1.表达矩阵 6expr_matrix <- as.matrix(count_matrix) 7# 2.细胞信息 8sample_ann <- data.frame(cells=names(count_matrix), 9 cellType=cluster) 10rownames(sample_ann)<- names(count_matrix) 11# 3.基因信息 12gene_ann <- as.data.frame(rownames(count_matrix)) 13rownames(gene_ann)<- rownames(count_matrix) 14colnames(gene_ann)<- "genes" 15 16# 然后转换为AnnotatedDataFrame对象 17pd <- new("AnnotatedDataFrame", 18 data=sample_ann) 19fd <- new("AnnotatedDataFrame", 20 data=gene_ann) 21 22# 最后构建CDS对象 23sc_cds <- newCellDataSet( 24 expr_matrix, 25 phenoData = pd, 26 featureData =fd, 27 expressionFamily = negbinomial.size(), 28 lowerDetectionLimit=1)

关于其中的参数expressionFamily ：负二项分布有两种方法，这里选用了negbinomial.size ；另外一种negbinomial稍微更准确一点，但速度大打折扣，它主要针对非常小的数据集

monocle V2质控过滤

=> detectGenes() + subset()

1cds=sc_cds 2cds <- detectGenes(cds, min_expr = 0.1) 3# 结果保存在cds@featureData@data 4> print(head(cds@featureData@data)) 5 genes num_cells_expressed 6RP11-34P13.7 RP11-34P13.7 0 7FO538757.2 FO538757.2 20 8AP006222.2 AP006222.2 11 9RP4-669L17.10 RP4-669L17.10 0 10RP11-206L10.9 RP11-206L10.9 5 11LINC00115 LINC00115 0

在monocle版本2.12.0中，取消了fData函数（此前在2.10版本中还存在）。如果遇到不能使用fData的情况，就可以采用备选方案：cds@featureData@data

然后进行基因过滤 =>subset()

1expressed_genes <- row.names(subset(cds@featureData@data, 2 num_cells_expressed >= 5)) 3 4> length(expressed_genes) 5[1] 12273 6 7cds <- cds[expressed_genes,]

monocle V2 聚类

step1：dispersionTable() 目的是判断使用哪些基因进行细胞分群

当然可以使用全部基因，但这会掺杂很多表达量不高而检测不出来的基因，反而会增加噪音。最好是挑有差异的，挑表达量不太低的

1cds <- estimateSizeFactors(cds) 2cds <- estimateDispersions(cds) 3disp_table <- dispersionTable(cds) # 挑有差异的 4unsup_clustering_genes <- subset(disp_table, mean_expression >= 0.1) # 挑表达量不太低的 5cds <- setOrderingFilter(cds, unsup_clustering_genes$gene_id) # 准备聚类基因名单 6plot_ordering_genes(cds) 7# 图中黑色的点就是被标记出来一会要进行聚类的基因

[可省略]step2：plot_pc_variance_explained() 选一下主成分

1plot_pc_variance_explained(cds, return_all = F) # norm_method='log'

step3: 聚类

1# 进行降维 2cds <- reduceDimension(cds, max_components = 2, num_dim = 6, 3 reduction_method = 'tSNE', verbose = T) 4# 进行聚类 5cds <- clusterCells(cds, num_clusters = 4) 6# Distance cutoff calculated to 1.812595 7plot_cell_clusters(cds, 1, 2, color = "cellType")

monocle V2 差异分析

=> differentialGeneTest()

1# 这个过程比较慢！ 2start=Sys.time() 3diff_test_res <- differentialGeneTest(cds, 4 fullModelFormulaStr = "~cellType") 5end=Sys.time() 6end-start 7# Time difference of 5.729718 mins

然后得到差异基因

1sig_genes <- subset(diff_test_res, qval < 0.1) 2> nrow(sig_genes) 3[1] 635 4# 最后会得到635个差异基因 5 6> head(sig_genes[,c("genes", "pval", "qval")] ) 7 genes pval qval 8ISG15 ISG15 1.493486e-03 0.040823046 9CCNL2 CCNL2 2.228521e-03 0.055590716 10MIB2 MIB2 4.954659e-05 0.002523176 11MMP23B MMP23B 1.009464e-04 0.004657577 12TNFRSF25 TNFRSF25 1.608900e-04 0.006785583 13CAMTA1 CAMTA1 4.339489e-03 0.090162380

进行热图可视化

文章使用的基因如下（也就是第一张图中显示的）

1htmapGenes=c( 2 'GAPDH','CD52','TRAC','IL32','ACTB','ACTG1','COTL1', 3 'GZMA','GZMB','GZMH','GNLY' 4) 5# 看到作者挑选的这些基因也都出现在我们自己分析的结果中 6> htmapGenes %in% rownames(sig_genes) 7 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

下面开始画图

先画一个最最最原始的

1library(pheatmap) 2dat=count_matrix[htmapGenes,] 3pheatmap(dat)

需要修改的地方：列名（barcode）不需要、聚类不需要、数据需要z-score

再来一版

1n=t(scale(t(dat))) 2# 规定上下限（和原文保持一致） 3n[n>2]=2 4n[n< -1]= -1 5> n[1:4,1:4] 6 AAACCTGCAACGATGG.3 AAACCTGGTCTCCATC.3 AAACGGGAGCTCCTTC.3 7GAPDH -1.0000000 -0.06057709 -0.37324949 8CD52 0.6676167 0.31653833 0.04253204 9TRAC 0.7272507 0.63277670 -0.51581312 10IL32 -1.0000000 0.42064114 -0.16143114 11 AAAGCAATCATATCGG.3 12GAPDH -1.0000000 13CD52 -1.0000000 14TRAC -0.5158131 15IL32 0.1548693 16 17# 加上细胞归属注释（ac = annotation column），去掉列名和聚类 18ac=data.frame(group=cluster) 19rownames(ac)=colnames(n) 20 21pheatmap(n,annotation_col = ac, 22 show_colnames =F, 23 show_rownames = T, 24 cluster_cols = F, 25 cluster_rows = F)