5个肺癌细胞系单细胞混合测序也可以轻松拆分 / 开普饭

考虑到咱们生信技能树粉丝对单细胞数据挖掘的需求，我开通了一个专栏《100个单细胞转录组数据降维聚类分群图表复现》，也亲自示范了几个，不过自己带娃，读博，时间精力有限，所以把剩余的90多个任务安排了学徒，实习生，学员。真的是太棒了，群策群力！

下面七月份学徒的投稿

今天介绍的单细胞转录组文章是2018年8月发表在《PLoS Comput Biol》杂志的R包：《scPipe: A flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data》，链接是：https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1006361

scPipe是处理单细胞RNA-seq数据的流程，从fastq文件开始，产生带有相关质量控制信息的基因表达矩阵。可处理由CEL-seq、MARS-seq、Drop-seq、Chromium 10x和SMART-seq的fastq数据。

有意思的是，听起来如此强大的工具似乎并没有什么存在感，大家通常是5个R包，分别是: scater,monocle,Seurat,scran,M3Drop，需要熟练掌握它们的对象，：一些单细胞转录组R包的对象而且分析流程也大同小异：

step1: 创建对象
step2: 质量控制
step3: 表达量的标准化和归一化
step4: 去除干扰因素(多个样本整合)
step5: 判断重要的基因
step6: 多种降维算法
step7: 可视化降维结果
step8: 多种聚类算法
step9: 聚类后找每个细胞亚群的标志基因
step10: 继续分类

不过这个工具是否流行并不是本文要讨论发范围，本次复现的目标图表是该工具作者的作者博士论文《New protocols and computational tools for scRNAseq analysis》中的：

作者比较了2代和3代测序数据的单细胞转录组，希望可以区分不同的肺癌细胞系（但是对于三代数据还没有很好的处理方法。所以本次我们仅使用2代测序数据）：

数据仅仅是一个10x样品样本，包括5个肺腺癌细胞系混合测序的，数据链接是：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126906

GSM3618014 5 human lung adenocarcinoma cell lines H2228, H1975, A549, H838 and HCC827

作者提供的GSE126906数据集中，包含GSM3618014这个样本：

下面我们开始复现：

在运行环境下自己手动创建一个GSM3618014文件夹，里面包含下面两个文件（自己在GEO数据库下载哦）：

step1：读取表达矩阵文件，然后构建Seurat对象

构建Seurat对象后保存为 sce_all.Rdata文件，供后续使用

rm(list = ls()) options(stringsAsFactors = F) library(scRNAseq) library(scater) library(ggplot2) library(data.table) barcode <- read.csv('./GSM3618014/GSE126906_barcodes.csv.gz') gene_count <- fread('./GSM3618014/GSM3618014_gene_count.csv.gz') head(barcode) gene_count[1:4,1:4]

可以看到count文件的列名并不是barcode需要使用barcode文件转换一下，并且基因名要从gene_id转换为symbol

值得注意的是，这个数据集因为不清楚其ensembl数据库的gene_id来源于哪个gtf版本，所以我们没有使用常规的id转换方法，而是使用了 biomaRt包的getBM函数，这个函数非常考验网络哦！

colnames(gene_count) <- barcode$barcode_sequence[match(colnames(gene_count),barcode$cell_name)] #需要转换基因名,gene_id转换为symbol #用了biomaRt包 library(biomaRt) library(tidyverse) mart <- useMart('ENSEMBL_MART_ENSEMBL','hsapiens_gene_ensembl') ensembl_id <-gene_count[,1] gene_symbol <- getBM(attributes = c('ensembl_gene_id','external_gene_name','hgnc_symbol'), filters = 'ensembl_gene_id', values = ensembl_id, mart = mart) ids <- gene_symbol[,c(1,3)] colnames(ids) <- c('gene_id','symbol') ids = ids[!duplicated(ids$symbol),] dat <- inner_join(gene_count,ids,by = 'gene_id') #开始形成表达矩阵啦 dat <- as.data.frame(dat) rownames(dat) <- dat$symbol raw.data <- dat[, -1] #有一些行名为空 raw.data = raw.data[rownames(raw.data) != '',]

现在我们就得到了 raw.data 变量里面存储的表达量矩阵啦！

接着使用raw.data构建Seurat对象

#接着使用raw.data构建Seurat对象 sce.all=CreateSeuratObject(counts = raw.data,) save(sce.all,file="sce_all.Rdata") #计算线粒体基因比例 sce.all=PercentageFeatureSet(sce.all, "^MT-", col.name = "percent_mito") fivenum(sce.all@meta.data$percent_mito) #计算核糖体基因比例 sce.all=PercentageFeatureSet(sce.all, "^RP[SL]", col.name = "percent_ribo") fivenum(sce.all@meta.data$percent_ribo) #计算红血细胞基因比例 sce.all=PercentageFeatureSet(sce.all, "^HB[^(P)]", col.name = "percent_hb") #过滤指标1:最少表达基因数的细胞&最少表达细胞数的基因 selected_c <- WhichCells(sce.all, expression = nFeature_RNA > 300) selected_f <- rownames(sce.all)[Matrix::rowSums(sce.all@assays$RNA@counts>0) > 3] sce.all.filt <- subset(sce.all, features = selected_f, cells = selected_c) #过滤指标2:线粒体/核糖体基因比例 selected_mito <- WhichCells(sce.all.filt, expression = percent_mito < 15) selected_ribo <- WhichCells(sce.all.filt, expression = percent_ribo > 3) selected_hb <- WhichCells(sce.all.filt, expression = percent_hb < 0.1) sce.all.filt <- subset(sce.all.filt, cells = selected_mito) sce.all.filt <- subset(sce.all.filt, cells = selected_ribo) sce.all.filt <- subset(sce.all.filt, cells = selected_hb)

step2：降维聚类分群

sce.all.filt = NormalizeData(sce.all.filt) sce.all.filt = FindVariableFeatures(sce.all.filt,nfeatures = 3000) sce.all.filt = ScaleData(sce.all.filt ) # 降维处理 sce.all.filt = RunPCA(sce.all.filt, npcs = 20) sce.all.filt = RunTSNE(sce.all.filt, npcs = 20) sce.all.filt = RunUMAP(sce.all.filt, dims = 1:10) dim(sce.all.filt) dim(sce.all) # 聚类分析 sce.all.filt <- FindNeighbors(sce.all.filt, dims = 1:10) sce.all.filt <- FindClusters(sce.all.filt, resolution = c(0.01,0.05,0.1,0.3,0.4,0.5,0.7)) p2_tree <- clustree(sce.all.filt@meta.data, prefix = "RNA_snn_res.") p2_tree

DimPlot(sce.all.filt, reduction = "umap", group.by = "RNA_snn_res.0.05",label = T)

文献里面提到了是5个不同细胞系，所以我们这里使用resolution = 0.05 进行后续分析即可，因为它已经是分成5个细胞亚群：

step3：细胞亚群的生物学注释

这个需要一定生物学背景知识，而且需要一定的检索能力，找到GSE160683_NSCLC_panel_RNAseq_FPKM_all.csv这个文件，其实去CCLE数据库也可以找到全部的肺癌相关细胞系表达量矩阵文件。

#####step3 细胞注释#### Ref=as.data.frame(data.table::fread("./GSM3618014/GSE160683_NSCLC_panel_RNAseq_FPKM_all.csv.gz"))

这里我们使用有60个人肺部细胞的数据集作为参考数据集，进行细胞生物学注释，数据链接在：https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE160683

60 human lung cell lines: 50 lung adenocarcinoma cell lines; 7 NSCLC (non small cell lung cancer, non-adenocarcinoma) cell lines; 3 non-transformed lung cell lines.

提取与文中相对应的细胞系；

target_CL=c("H2228|NCI-H1975|HCC827|H838|A549") grep(target_CL,colnames(Ref),value = T) Ref=Ref[!duplicated(Ref$geneName),] rownames(Ref)=Ref[,3] Ref=Ref[,grep(target_CL,colnames(Ref),value = T)] dim(Ref);head(Ref) # 首先使用 SingleCellExperiment 对参考细胞系表达量矩阵构建对象 # 然后使用 SingleR 依据参考细胞系进行映射 ref_sce=SingleCellExperiment::SingleCellExperiment(assays=list(counts=Ref)) ref_sce=scater::logNormCounts(ref_sce) logcounts(ref_sce)[1:4,1:4] colData(ref_sce)$Type=stringr::str_split(colnames(Ref),"_",simplify = T)[,1] ref_sce library(SingleR) testdata <- GetAssayData(sce.all.filt, slot="data") pred <- SingleR(test=testdata, ref=ref_sce, labels=ref_sce$Type, clusters = sce.all.filt$RNA_snn_res.0.05) table(pred$labels) head(pred) celltype <- data.frame(ClusterID = 0:4, celltype = c("A549","H838","H2228","HCC827","H1975")) sce.all.filt@meta.data$celltype = "NA" for(i in 1:nrow(celltype)){ sce.all.filt@meta.data[which(sce.all.filt@meta.data$RNA_snn_res.0.05 == celltype$ClusterID[i]) ,'celltype'] <- celltype$celltype[i]} DimPlot(sce.all.filt, reduction = "umap", group.by = "celltype",label = T) & NoLegend() ggsave('celltype_umap.png',width = 6,height = 6)

可以看到，效果非常好，很容易就把不同细胞系从前面的10x样品里面区分出来啦！

如果你也对10x单细胞转录组感兴趣，参考我们的《明码标价》专栏里面的单细胞内容

单细胞转录组数据分析的标准降维聚类分群，并且进行生物学注释后的结果。可以参考前面的例子：人人都能学会的单细胞聚类分群注释，我们演示了第一层次的分群。

如果你对单细胞数据分析还没有基础认知，可以看基础10讲：

写在文末

咱们现在这个专栏《100个单细胞转录组数据降维聚类分群图表复现》分享的代码是到此为止，但是一般来说单细胞文章数据分析还有很多进阶图表制作，比如inferCNV看肿瘤拷贝数变异，monocle看拟时序等等。如果你也需要，可以加入我们这个专栏《100个单细胞转录组数据降维聚类分群图表复现》创作团队，获取进阶指引哦！见：急！计划招募100个单细胞爱好者，免费学全套单细胞降维聚类分群和生物学亚群注释

5个肺癌细胞系单细胞混合测序也可以轻松拆分

下面我们开始复现：

step1：读取表达矩阵文件，然后构建Seurat对象

step2：降维聚类分群

step3：细胞亚群的生物学注释

写在文末

单细胞服务列表

相关推荐