前言

前面我们的教程讲到了，自己取全部的上皮细胞，以及部分Fibroblasts和Endothelial_cells细胞来一起运行inferCNV流程，但是得到的结果很诡异，明明是作为二倍体正常细胞参考集的Fibroblasts和Endothelial_cells细胞居然也是在某些染色体上面有明显的CNV情况。为了解决这个问题，让我们一起看看文献自己的inferCNV流程是如何使用的，以及对应的数据集。

首先运行作者自己的代码和数据

那，我们就看看作者自己的代码和数据吧，运行他们的inferCNV流程，看看我们的差异究竟是在哪了？

我注意到文章的脚本里面有这样的一句话：

Save all inferCNV files and run inferCNV in previous version of R

看了看作者准备的3个文件，如下：

183K Aug 30 11:33 NI03_CNV_cell_metadata_shuffle_largefile.txt 544M Aug 30 12:02 NI03_CNV_data_out_all_cells_raw_counts_largefile.txt 671K Aug 30 11:31 NI03_CNV_hg19_genes_ordered_correct_noXY.txt

上面的3个文件作者的制作方式，跟我的大同小异，就不过多介绍啦，然后运行作者的inferCNV代码，如下；

library(infercnv) infercnv_obj = CreateInfercnvObject(raw_counts_matrix = "NI03_CNV_data_out_all_cells_raw_counts_largefile.txt", annotations_file = "NI03_CNV_cell_metadata_shuffle_largefile.txt", gene_order_file = "NI03_CNV_hg19_genes_ordered_correct_noXY.txt", ref_group_names = c("endothelial_normal", "fibroblast_normal"), delim = "\t") # Make sure that chrmosomes are ordered correctly slot(infercnv_obj, "gene_order")[,"chr"] <- factor(slot(infercnv_obj, "gene_order")[,"chr"], levels = c("chr1", "chr2","chr3","chr4", "chr5", "chr6","chr7", "chr8", "chr9","chr10", "chr11", "chr12","chr13", "chr14", "chr15","chr16", "chr17", "chr18","chr19", "chr20", "chr21","chr22")) # Run infer CNV infercnv_all = infercnv::run(infercnv_obj, cutoff=1, # use 1 for smart-seq, 0.1 for 10x-genomics out_dir= "myresults", # dir is auto-created for storing outputs cluster_by_groups=F, # cluster hclust_method="ward.D2", plot_steps=F)

我仔细看了看作者运行inferCNV的代码，差异真的很小，其中cluster_by_groups这个参数仅仅是可视化的选项，不会影响重要的结论。而hclust_method通常呢，影响细胞之间的距离，按照道理并不影响CNV，那么应该是我前面的那些其它参数导致的。

让我们看看这个函数的默认参数：

run(infercnv_obj, cutoff = 1, min_cells_per_gene = 3, out_dir = NULL, window_length = 101, smooth_method = c("pyramidinal", "runmeans", "coordinates"), num_ref_groups = NULL, ref_subtract_use_mean_bounds = TRUE, cluster_by_groups = FALSE, cluster_references = TRUE, k_obs_groups = 1, hclust_method = "ward.D2", max_centered_threshold = 3, scale_data = FALSE, HMM = FALSE, HMM_transition_prob = 1e-06, HMM_report_by = c("subcluster", "consensus", "cell"), HMM_type = c("i6", "i3"), HMM_i3_pval = 0.05, HMM_i3_use_KS = TRUE, BayesMaxPNormal = 0.5, sim_method = "meanvar", sim_foreground = FALSE, reassignCNVs = TRUE, analysis_mode = c("samples", "subclusters", "cells"), tumor_subcluster_partition_method = c("random_trees", "qnorm", "pheight", "qgamma", "shc"), tumor_subcluster_pval = 0.1, denoise = FALSE, noise_filter = NA, sd_amplifier = 1.5, noise_logistic = FALSE, outlier_method_bound = "average_bound", outlier_lower_bound = NA, outlier_upper_bound = NA, final_scale_limits = NULL, final_center_val = NULL, debug = FALSE, num_threads = 4, plot_steps = FALSE, resume_mode = TRUE, png_res = 300, plot_probabilities = TRUE, save_rds = TRUE, save_final_rds = TRUE, diagnostics = FALSE, remove_genes_at_chr_ends = FALSE, prune_outliers = FALSE, mask_nonDE_genes = FALSE, mask_nonDE_pval = 0.05, test.use = "wilcoxon", require_DE_all_normals = "any", hspike_aggregate_normals = FALSE, no_plot = FALSE, no_prelim_plot = FALSE, output_format = "png", useRaster = TRUE, up_to_step = 100)

多到让人头皮发麻！

其中文献运行infercnv::run的时候，下面两个参数，都是默认值：

HMM参数 when set to True, runs HMM to predict CNV level (default: FALSE) denoise If True, turns on denoising according to options below (default: FALSE)

而我运行的时候，把这两个参数都设置为了T，运行该文献他自己的数据集和文献代码后，运行的日志文件如下所示：

INFO [2020-10-19 11:17:44] ::process_data:Start INFO [2020-10-19 11:17:44] Creating output path myresults INFO [2020-10-19 11:17:44] Checking for saved results. INFO [2020-10-19 11:17:44]


STEP 1: incoming data
INFO [2020-10-19 11:18:19]
STEP 02: Removing lowly expressed genes
INFO [2020-10-19 11:18:19] ::above_min_mean_expr_cutoff:Start

INFO [2020-10-19 11:18:19] Removing 5929 genes from matrix as below mean expr threshold: 1

INFO [2020-10-19 11:18:20] validating infercnv_obj

INFO [2020-10-19 11:18:20] There are 14467 genes and 7181 cells remaining in the expr matrix.

INFO [2020-10-19 11:18:24] no genes removed due to min cells/gene filter

INFO [2020-10-19 12:19:51] plot_cnv_references:Start INFO [2020-10-19 12:19:51] Reference data size: Cells= 1000 Genes= 14467 INFO [2020-10-19 12:20:07] plot_cnv_references:Number reference groups= 2 INFO [2020-10-19 12:20:07] plot_cnv_references:Plotting heatmap. INFO [2020-10-19 12:20:10] Colors for breaks: #00008B,#24249B,#4848AB,#6D6DBC,#9191CC,#B6B6DD,#DADAEE,#FFFFFF,#EEDADA,#DDB6B6,#CC9191,#BC6D6D,#AB4848,#9B2424,#8B0000 INFO [2020-10-19 12:20:10] Quantiles of plotted data range: 0.544327010335531,0.938892738431902,1,1.06175861606737,1.53099452887365 INFO [2020-10-19 12:20:10] plot_cnv_references:Writing reference data to myresults/infercnv.references.txt

耗时约1个小时（主要的时间花在了第15步），关键问题是，他得到的CNV非常漂亮。也就是说如果不考虑数据集的差异，这个时候得到的结论是HMM参数和 denoise参数都应该是默认值才行啊。

然后运行我的代码在作者的数据

跟上一讲我们的代码大同小异，如下：

rm(list = ls())


dat=read.table('NI03_CNV_data_out_all_cells_raw_counts_largefile.txt',

               header = T,sep = '\t')

dim(dat)

library(AnnoProbe)

geneInfor=annoGene(rownames(dat),"SYMBOL",'human')

colnames(geneInfor)

geneInfor=geneInfor[with(geneInfor, order(chr, start)),c(1,4:6)]

geneInfor=geneInfor[!duplicated(geneInfor[,1]),]

length(unique(geneInfor[,1]))

head(geneInfor)

## 这里可以去除性染色体

# 也可以把染色体排序方式改变

dat=dat[rownames(dat) %in% geneInfor[,1],]

dat=dat[match( geneInfor[,1], rownames(dat) ),] 

dim(dat)
groupFiles='groupFiles.txt'

groupinfo=read.table('NI03_CNV_cell_metadata_shuffle_largefile.txt',header = F,sep = '\t')

table(groupinfo$V2)

dim(groupinfo)

head(groupinfo)

table(groupinfo$V1 %in% colnames(dat))

write.table(groupinfo,file = groupFiles,sep = '\t',quote = F,col.names = F,row.names = F)

dat=dat[, colnames(dat) %in% groupinfo$V1]
expFile='expFile.txt'

write.table(dat,file = expFile,sep = '\t',quote = F)
head(geneInfor)

geneFile='geneFile.txt'

write.table(geneInfor,file = geneFile,sep = '\t',quote = F,col.names = F,row.names = F)

infercnv_obj = CreateInfercnvObject(raw_counts_matrix=expFile, annotations_file=groupFiles, delim="\t", gene_order_file= geneFile, ref_group_names = c("endothelial_normal", "fibroblast_normal") ) infercnv_obj = infercnv::run(infercnv_obj, cutoff=1, # cutoff=1 works well for Smart-seq2, and cutoff=0.1 works well for 10x Genomics out_dir='jimmy_results', cluster_by_groups=TRUE, denoise=TRUE, HMM=TRUE)

这个时候，我的时间主要是花费在了第STEP 18: Run Bayesian Network Model on HMM predicted CNV's

INFO [2020-10-20 09:25:35] Creating the following Directory: jimmy_results/BayesNetOutput.HMMi6.hmm_mode-samples INFO [2020-10-20 09:25:35] Initializing new MCM InferCNV Object. INFO [2020-10-20 09:25:35] validating infercnv_obj INFO [2020-10-20 09:25:36] Total CNV's: 1230 INFO [2020-10-20 09:25:36] Loading BUGS Model. INFO [2020-10-20 09:25:38] Running Sampling Using Parallel with 4 Cores

中间调用了我MAC电脑的4个核心去计算，值得一提的是，因为等待时间过长，经常出现错误！！！，如下所示：

INFO [2020-10-19 15:46:13] Initializing new MCM InferCNV Object. INFO [2020-10-19 15:46:13] validating infercnv_obj INFO [2020-10-19 15:46:14] Total CNV's: 1239 INFO [2020-10-19 15:46:14] Loading BUGS Model. INFO [2020-10-19 15:46:16] Running Sampling Using Parallel with 4 Cores INFO [2020-10-19 18:21:03] Obtaining probabilities post-sampling Error in do.call(rbind, mcmc[[j]]) : second argument must be a list In addition: Warning message: In parallel::mclapply(seq_along(obj@cell_gene), FUN = par_func, : scheduled cores 1, 2, 3, 4 did not deliver results, all values of the jobs will be affected

运行了6次，都失败，让我很恼火，差不多的数据和代码，为什么我自己运行十多分钟即可，文章的这个需要十几个小时。

最后

后来我仔细比较了，发现自己的数据里面，是因为 366 genes and 7044 cells ，得到是CNV数量太少了（第18步写的是：Total CNV's: 31 ）计算量比较小，所以十几分钟就结束了。

但是文章的这个数据集呢， Total CNV's: 1229 太多了，耗费计算时间和资源有点过分了。这个数据量：14869 genes and 7181 cells 其实不能选择 denoise=TRUE以及HMM=TRUE，都应该是用默认的FALSE即可。

所以我真正需要比较的是，为什么我自己运行inferCNV的时候的输入数据跟作者的差异这么大！！！

咱们明明都是取全部的上皮细胞，以及部分Fibroblasts和Endothelial_cells细胞来一起运行inferCNV流程啊！！！

CNS图表复现14—检查文献的inferCNV流程

前言

首先运行作者自己的代码和数据

然后运行我的代码在作者的数据

最后

相关推荐