在单细胞转录组表达矩阵里面去除细胞周期影响

背景介绍

早在2015年发表在Nat. Biotechnol文章就提出了 scLVM (single-cell latent variable model)来在单细胞转录组数据里面去除细胞周期影响 但是 scLVM 仅仅是考虑 细胞周期直接相关基因,而且没有考虑细胞类型,其实不同类型的细胞哪怕是在同一个时间点的细胞周期状态,它们的细胞周期相关基因表达也是不同的。更重要的是,还有很多非直接细胞周期相关基因也需要考虑。

所以作者开发了ccRemover来去除单细胞转录组数据里面去除细胞周期影响,发表于2016年的SR

直接读作者的说明书即可学会使用它:https://cran.r-project.org/web/packages/ccRemover/vignettes/ccRemover_tutorial.html

使用R包

核心代码就一句话

可以看到主要就是使用ccRemover函数处理我们的单细胞表达矩阵哦

xhat <- ccRemover(dat, bar=FALSE)
xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 10, bar=FALSE)

可以简单处理,也可以根据理解,加上一系列的参数。 表达矩阵需要是归一化的。 下面我们就具体讲解。

表达矩阵的前期处理

首先安装并且加载包和测试数据

#BiocInstaller::biocLite('ccRemover')
library(ccRemover)
data(t.cell_data)
## 表达矩阵如下;
head(t.cell_data[,1:5])
##        Cell 1  Cell 2  Cell 3  Cell 4 Cell 5
## Gnai3 3.12560 0.86096 2.62610 3.25490 3.7152
## Cdc45 3.09290 0.11331 3.51750 0.47994 2.5712
## Narf  0.25414 0.00000 1.79130 0.00000 1.4729
## Klf6  1.66590 3.00130 1.92170 1.93360 4.1109
## Scmh1 0.25414 0.11331 0.00000 0.00000 1.7960
## Wnt3  0.52967 0.27746 0.61211 0.60523 3.5357
## 基因表达量在样本的平均值的汇总
summary(apply(t.cell_data,1, mean))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.0392  0.8990  1.5738  1.6202  2.2817  5.1372
## 样本的所有的基因的表达量的平均值的汇总
summary(apply(t.cell_data,2, mean))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.3997  1.3119  1.6681  1.6202  1.9348  2.7493
## 保留每个基因的所有样本表达量平均值以备后用
mean_gene_exp <- rowMeans(t.cell_data)
# 每个基因减去其平均值后的表达量矩阵
t_cell_data_cen <- t.cell_data - mean_gene_exp
## 接近于 0 了
summary(apply(t_cell_data_cen,1,mean))
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.
## -4.441e-16 -5.211e-17 -6.924e-19 -1.453e-18  4.661e-17  3.838e-16
gene_names <- rownames(t_cell_data_cen)
head(gene_names)
## [1] "Gnai3" "Cdc45" "Narf"  "Klf6"  "Scmh1" "Wnt3"
## gene_indexer函数会根据包内置的细胞周期相关基因来判断我们的表达矩阵的基因是否属于
cell_cycle_gene_indices <- gene_indexer(gene_names, species = "mouse",
                                       name_type = "symbols" )
## Invalid name type input. Switching to NULLNo name format input.
## Checking to see if match can be found:
## Best guess is  symbol  IDs
## 751  matches out of a possible  7073
if_cc <- rep(FALSE,nrow(t_cell_data_cen))
if_cc[cell_cycle_gene_indices] <- TRUE
summary(if_cc)
##    Mode   FALSE    TRUE
## logical    6322     751
## 构造表达矩阵以及其对应的基因是否属于细胞周期相关基因集
dat <- list(x=t_cell_data_cen, if_cc=if_cc)
xhat <- ccRemover(dat, bar=FALSE)
## 0.1061784  of genes are cell-cycle genes
## Iteration  1 ...
## Bootstrapping...The bootstrap results on the top 10 components are:      xn_load   xy_load   diff_load t_load_boot
## PC1  4.764066 5.2458626  0.48179702   6.7167008
## PC2  1.899494 1.9400189  0.04052475   0.6834059
## PC3  1.307673 1.4149245  0.10725191   1.6660350
## PC4  1.134286 1.3889111  0.25462511   1.9876191
## PC5  1.108502 1.1296735  0.02117137   0.1670011
## PC6  1.198600 1.1571330 -0.04146651  -0.4241297
## PC7  1.076888 0.9899750 -0.08691289  -1.0550547
## PC8  1.045047 0.9508303 -0.09421625  -1.1941491
## PC9  1.209063 1.2256193  0.01655664   0.2071162
## PC10 1.167671 1.2846210  0.11695048   1.4982856
## The follow components are removed: 1
##
## Iteration  2 ...
## Bootstrapping...The bootstrap results on the top 10 components are:      xn_load   xy_load   diff_load t_load_boot
## PC1  1.899494 1.9400189  0.04052475   0.8808403
## PC2  1.307673 1.4149245  0.10725191   1.5588938
## PC3  1.134286 1.3889111  0.25462511   1.8225974
## PC4  1.108502 1.1296735  0.02117137   0.1959952
## PC5  1.198600 1.1571330 -0.04146651  -0.4472453
## PC6  1.076888 0.9899750 -0.08691289  -1.2271217
## PC7  1.045047 0.9508303 -0.09421625  -1.3564926
## PC8  1.209063 1.2256193  0.01655664   0.2235150
## PC9  1.167671 1.2846210  0.11695048   1.4904740
## PC10 1.009986 0.9815659 -0.02841966  -0.4561793
## No more cell-cycle effect is detected.
xhat <- xhat + mean_gene_exp
# 最后可以把基因的平均值加回去

高级参数

xhat <- ccRemover(dat, cutoff = 3, max_it = 4, nboot = 200, ntop = 10, bar=FALSE)
## 0.1061784  of genes are cell-cycle genes
## Iteration  1 ...
## Bootstrapping...The bootstrap results on the top 10 components are:      xn_load   xy_load   diff_load t_load_boot
## PC1  4.764066 5.2458626  0.48179702   7.0666399
## PC2  1.899494 1.9400189  0.04052475   0.7109023
## PC3  1.307673 1.4149245  0.10725191   1.4957446
## PC4  1.134286 1.3889111  0.25462511   1.9186261
## PC5  1.108502 1.1296735  0.02117137   0.1781067
## PC6  1.198600 1.1571330 -0.04146651  -0.4179298
## PC7  1.076888 0.9899750 -0.08691289  -1.1317599
## PC8  1.045047 0.9508303 -0.09421625  -1.3176625
## PC9  1.209063 1.2256193  0.01655664   0.2108930
## PC10 1.167671 1.2846210  0.11695048   1.4402412
## The follow components are removed: 1
##
## Iteration  2 ...
## Bootstrapping...The bootstrap results on the top 10 components are:      xn_load   xy_load   diff_load t_load_boot
## PC1  1.899494 1.9400189  0.04052475   0.8727071
## PC2  1.307673 1.4149245  0.10725191   1.6748962
## PC3  1.134286 1.3889111  0.25462511   1.9396499
## PC4  1.108502 1.1296735  0.02117137   0.1671651
## PC5  1.198600 1.1571330 -0.04146651  -0.4822679
## PC6  1.076888 0.9899750 -0.08691289  -1.2233668
## PC7  1.045047 0.9508303 -0.09421625  -1.4931038
## PC8  1.209063 1.2256193  0.01655664   0.2173142
## PC9  1.167671 1.2846210  0.11695048   1.5028994
## PC10 1.009986 0.9815659 -0.02841966  -0.3995873
## No more cell-cycle effect is detected.

建议直接看英文:

The 'cutoff’ is used to determine which of the effects are cell-cycle effects. The default and recommended value is 3, which roughly corresponds to a p-value of 0.01. For data sets which have very low levels of cell-cycle activity this value can be lowered to increase the detection of cell-cycle effects. Example 3 in the original manuscript was a case where a lower value of the cutoff was necessary.

The 'max.it’ value is the maximum number of iterations of the method. ccRemover will stop whenever it detects no more significant effects present in the data or it reaches its maximum number of iterations. The default value is 4 but we have found that for many data sets the cell-cycle effect will be effectively removed after 1 or 2 iterations.

The 'nboot’ value corresponds to the number of bootstrap repetitions carried out to test the significance of the components. Please refer to the methods section original manuscript for a detailed description of the estimation process. The default value is 200 and we have found this work effectively for most data sets.

The 'ntop’ parameter determines the number of principal components which are to be tested as cell-cycle effects upon each iteration. We have found that the default value of 10 works effectively for most data sets. However, for extremely diverse data sets within which there are expected to be many sources of variation this value could be raised so that all elements of the cell-cycle effect can be identified and removed.

作者在真实数据测试了,其中  human glioblastomas data 自己都可以重复一下:

文章链接:https://www.nature.com/articles/srep33892

既然可以去除细胞周期相关基因影响,那么同样是可以去除其它因素,同样的原理哦,试试看。

(0)

相关推荐