实战出真知——单细胞基础流程 / 开普饭

前言

最近在学单细胞测序，学习了B站生信技能树的单细胞教程，https://www.bilibili.com/video/BV1dt411Y7nn结合jimmy老师的代码，用GEO上的单细胞测序数据，做了一下分析。

我们选择的数据已经发表的文章题目是“Single cell RNA sequencing of human liver reveals distinct intrahepatic macrophage populations”，2018年发表在nature communications上，数据存在GSE115469。

1.下载并读入数据

在https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE115469下载数据文件”GSE115469_Data.csv.gz“

#一键清空 rm(list=ls()) options(stringsAsFactors = F) #读入并处理数据 a=read.csv('GSE115469_Data.csv') rownames(a)=a[,1] #将第一列基因名字作为列名 a=a[,-1]#去除第一列 dim(a) [1] 20007 8445 #8445个细胞，总共有20007个基因

2.载入seurat包，创建对象

注意，因为这篇文章作者已经上传了过滤后的矩阵，所以在 “CreateSeuratObject”函数中我们不需要用“min.cells =” 和“min.features = ”来过滤基因和细胞，关于数据如何处理的，详见https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM3178782%5D

library(Seurat) raw_sce <- CreateSeuratObject(counts = a)# save(raw_sce,file="raw_sce.Rdata") load("raw_sce.Rdata") raw_sce #An object of class Seurat #20007 features across 8444 samples within 1 assay #Active assay: RNA (20007 features, 0 variable features)

3.过滤基因和样本（这一步骤我们不做）

首先我们要找出线粒体基因和线粒体核糖体基因。

rownames(raw_sce)[grepl('^mt-',rownames(raw_sce),ignore.case = T)] rownames(raw_sce)[grepl('^Rp[sl]',rownames(raw_sce),ignore.case = T)] #其实在Seurat包中，PercentageFeatureSet函数可以用来提取并计算线粒体基因比例。 raw_sce[["percent.mt"]] <- PercentageFeatureSet(raw_sce, pattern = "^MT-") #提取并计算线粒体核糖体基因比例，为raw_sce添加线粒体核糖体基因注释信息。 rb.genes <- rownames(raw_sce)[grep("^RP[SL]",rownames(raw_sce),ignore.case = T)] C<-GetAssayData(object = raw_sce, slot = "counts") percent.ribo <- Matrix::colSums(C[rb.genes,])/Matrix::colSums(C)*100 raw_sce <- AddMetaData(raw_sce, percent.ribo, col.name = "percent.ribo")

作图

plot1 <- FeatureScatter(raw_sce, feature1 = "nCount_RNA", feature2 = "percent.mt") plot2 <- FeatureScatter(raw_sce, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") CombinePlots(plots = list(plot1, plot2))

继续作图，根据结果选择过滤标准

VlnPlot(raw_sce, features = c("percent.ribo", "percent.mt"), ncol = 2)

VlnPlot(raw_sce, features = c("nFeature_RNA", "nCount_RNA"), ncol = 2)

VlnPlot(raw_sce, features = c("percent.ribo", "nCount_RNA"), ncol = 2)

接下来我们可以通过subset函数过滤基因和样本，但是由于作者已经将过滤好了，所以我们不进行下一步subset操作。按照一般情况，都是设置一个阈值（10%）来过滤线粒体基因，但是由于肝细胞的特殊性，作者将阈值设置为50%，关于肝细胞阈值的选择，作者是这么解释的。

#过滤的代码如下： raw_sce <- subset(raw_sce, subset = nFeature_RNA >100 & nCount_RNA > 1000 & percent.mt < 50) #100，1000，50这3个数值要根据实际情况调整

4.标准化数据

sce=raw_sce sce<- NormalizeData(sce, normalization.method = "LogNormalize", scale.factor = 10000 ) GetAssay(sce,assay = "RNA") #Assay data with 20007 features for 8444 cells #First 10 features:RP11-34P13.7, FO538757.2, AP006222.2, RP4-669L17.10, RP5-857K21.4, RP11-206L10.9,LINC00115, FAM41C, RP11-54O7.1, RP11-54O7.3

5.找出变化较明显的基因

nfeatures选择多少取决于实际情况，这里我们选择5000

sce <- FindVariableFeatures(sce, selection.method = "vst", nfeatures = 5000)

6.对矩阵进行scale

由于保留的线粒体基因较多，因此我这里选择矫正的参数为"percent.mt","nCount_RNA"和"percent.ribo"试试看

sce <- ScaleData(sce,vars.to.regress=c("percent.mt","nCount_RNA","percent.ribo")) #运行scaleData之前要进行NormalizeData

6.PCA主成分

sce <- RunPCA(object = sce, pc.genes = VariableFeatures(sce)) DimHeatmap(sce, dims = 1:12, cells = 100, balanced = TRUE) #展示前12个主成分热图

判断选择的主成分数目，我选择19个

ElbowPlot(sce)

7.tSNE降维

#调整参数resolution = 0.55，得到20个cluster sce <- FindNeighbors(sce, dims = 1:19) sce <- FindClusters(sce, resolution = 0.55) #看看每个cluster的细胞数目 table(sce@meta.data$RNA_snn_res.0.55) > 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 1313 1047 768 682 634 517 517 514 489 484 349 286 167 137 118 107 97 93 18 19 90 35 set.seed(123) sce <- RunTSNE(object = sce, dims = 1:19, do.fast = TRUE) DimPlot(sce,reduction = "tsne",label=T)

看看文献的tSNE结果

给图片添加meta信息，结果貌似跟文献的图片结果差不多。P3TLH样本的细胞（绿色部分）基本独立存在，由3个cluster组成

phe=data.frame(cell=rownames(sce@meta.data), cluster =sce@meta.data$seurat_clusters) head(phe) table(phe$cluster) tsne_pos=Embeddings(sce,'tsne') DimPlot(sce,reduction = "tsne",group.by ='orig.ident')

再看看文献的图

DimPlot(sce,reduction = "tsne",label=T,split.by ='orig.ident')

#ggplot可视化 dat=cbind(tsne_pos,phe) library(ggplot2) p=ggplot(dat,aes(x=tSNE_1,y=tSNE_2,color=cluster))+geom_point(size=0.95) p=p+stat_ellipse(data=dat,aes(x=tSNE_1,y=tSNE_2,fill=cluster,color=cluster), geom = "polygon",alpha=0.2,level=0.9,type="t",linetype = 2,show.legend = F)+coord_fixed() theme= theme(panel.grid =element_blank()) + ##删去网格 theme(panel.border = element_blank(),panel.background = element_blank()) + ##删去外层边框 theme(axis.line = element_line(size=1, colour = "black")) p=p+theme+guides(colour = guide_legend(override.aes = list(size=5))) print(p)

#对marker基因进行可视化，这一步会把所有cluster的marker基因展示出来。这步骤运行时间较久 pro='first' table(sce@meta.data$seurat_clusters) for( i in unique(sce@meta.data$seurat_clusters) ){ markers_df <- FindMarkers(object = sce, ident.1 = i, min.pct = 0.25) print(x = head(markers_df)) markers_genes = rownames(head(x = markers_df, n = 5)) VlnPlot(object = sce, features =markers_genes,log =T ) ggsave(filename=paste0(pro,'_VlnPlot_subcluster_',i,'_sce.markers_heatmap.pdf')) FeaturePlot(object = sce, features=markers_genes ) ggsave(filename=paste0(pro,'_FeaturePlot_subcluster_',i,'_sce.markers_heatmap.pdf')) }

#提取marke基因，min.ct：设定研究的基因在两组细胞中的最小占比。 sce.markers <- FindAllMarkers(object = sce, only.pos = TRUE, min.pct = 0.25, thresh.use = 0.25)#only.pos = TRUE：只返回positive的marker基因 print(x = head(sce.markers)) > p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene GSTA1 0 1.283148 0.865 0.199 0 0 GSTA1 APOM 0 1.222117 0.854 0.185 0 0 APOM ANG 0 1.154127 0.986 0.374 0 0 ANG TAT-AS1 0 1.143226 0.953 0.237 0 0 TAT-AS1 UGT2B7 0 1.130114 0.841 0.182 0 0 UGT2B7 CYP3A4 0 1.126983 0.776 0.225 0 0 CYP3A4


DT::datatable(sce.markers)

write.csv(sce.markers,file=paste0('_sce.markers_tsne.csv'))

#可视化marker基因,我们选择ALB基因可视化看看 markers_genes = rownames(sce.markers[c("ALB"),]) VlnPlot(object = sce, features =markers_genes,log=T,ncol=2)

FeaturePlot(object = sce, features =markers_genes, ncol=2)

#热图可视化marker基因的表达差异 library(dplyr) top5 <- sce.markers %>% group_by(cluster) %>% top_n(5, avg_logFC) DoHeatmap(sce,top5$gene,size=2) ggsave(filename=paste0('_sce.markers_——stne_heatmap.pdf')) save(sce,sce.markers,file = 'last_sce_tsne.Rdata')

8.细胞周期注释

sce_cc <- CellCycleScoring(sce, s.features = cc.genes$s.genes, g2m.features =cc.genes$ g2m.genes, set.ident = F) head(sce_cc[[]]) DimPlot(sce_cc,reduction = "tsne",group.by ='Phase',cols=c("yellow","#458B74","#483D8B"))

再跟文献的图对比一下,可以看到中间那团大细胞群体基本都在G1期，跟我们的结果还是相似的。

体会

不同病种细胞的过滤标准是不一样的，本篇文献在过滤肝细胞的线粒体基因阈值选择50%,这个值是比较大的，原因作者在文献中也给出了解释。其实想想也是，肝脏是一个代谢旺盛的器官，本身需要较多的能量，线粒体是细胞中制造能量的结构，要是阈值设置太低，会把一些重要的基因忽略掉。
关于在tSNE中resolution 和PCA主成分选择等参数的调整，jimmy老师在公众号已经说得很清楚（CNS图表复现04—单细胞聚类分群的resolution参数问题）我就不再阐述了。记住：Don't Let the Tail Wag the Dog .

实战出真知——单细胞基础流程

前言