表达矩阵处理—表达QC(reads)

书籍翻译

好的书籍是人类进步的阶梯,但有些人却找不到优秀的阶梯,为此我们开设了书籍翻译这个栏目,作为你学习之路的指路明灯;分享国内外优秀书籍,弘扬分享精神,做一个知识的传播者。

希望大家能有所收获!

书籍翻译历史目录

引言—关于课程

scRNA-seq简介

⊙  scRNA-seq原始数据的质控

⊙  scRNA-seq数据处理—文件格式小结

⊙  scRNA-seq数据处理—demultiplexing

⊙  scRNA-seq数据的处理—STAR

⊙  scRNA-seq数据处理—Kallisto

⊙  scRNA-seq表达矩阵的构建

⊙  数据处理必备—R安装

⊙  数据处理基础—数据类型了解一下

⊙  数据处理基础—什么是整齐数据和Rich Data

⊙  数据处理基础—ggplot2了解一下

⊙  实例数据练习—Tabula Murlis

表达矩阵处理—表达质量的控制

7. 清理表达矩阵

7.2

表达QC(reads)

library(SingleCellExperiment)
library(scater)
options(stringsAsFactors = FALSE)
reads <- read.table("tung/reads.txt", sep = "\t")
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
head(reads[ , 1:3])
## NA19098.r1.A01 NA19098.r1.A02 NA19098.r1.A03
## ENSG00000237683 0 0 0
## ENSG00000187634 0 0 0
## ENSG00000188976 57 140 1
## ENSG00000187961 0 0 0
## ENSG00000187583 0 0 0
## ENSG00000187642 0 0 0
head(anno)
## individual replicate well batch sample_id
## 1 NA19098 r1 A01 NA19098.r1 NA19098.r1.A01
## 2 NA19098 r1 A02 NA19098.r1 NA19098.r1.A02
## 3 NA19098 r1 A03 NA19098.r1 NA19098.r1.A03
## 4 NA19098 r1 A04 NA19098.r1 NA19098.r1.A04
## 5 NA19098 r1 A05 NA19098.r1 NA19098.r1.A05
## 6 NA19098 r1 A06 NA19098.r1 NA19098.r1.A06
reads <- SingleCellExperiment(
assays = list(counts = as.matrix(reads)),
colData = anno
)
keep_feature <- rowSums(counts(reads) > 0) > 0
reads <- reads[keep_feature, ]
isSpike(reads, "ERCC") <- grepl("^ERCC-", rownames(reads))
isSpike(reads, "MT") <- rownames(reads) %in%
c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888",
"ENSG00000198886", "ENSG00000212907", "ENSG00000198786",
"ENSG00000198695", "ENSG00000198712", "ENSG00000198804",
"ENSG00000198763", "ENSG00000228253", "ENSG00000198938",
"ENSG00000198840")
reads <- calculateQCMetrics(
reads,
feature_controls = list(
ERCC = isSpike(reads, "ERCC"),
MT = isSpike(reads, "MT")
)
)
hist(
reads$total_counts,
breaks = 100
)
abline(v = 1.3e6, col = "red")

filter_by_total_counts <- (reads$total_counts > 1.3e6)
table(filter_by_total_counts)
## filter_by_total_counts
## FALSE TRUE
## 180 684
hist(
reads$total_features,
breaks = 100
)
abline(v = 7000, col = "red")

filter_by_expr_features <- (reads$total_features > 7000)
table(filter_by_expr_features)
## filter_by_expr_features
## FALSE TRUE
## 116 748
plotPhenoData(
  reads,
  aes_string(
      x = "total_features",
      y = "pct_counts_MT",
      colour = "batch"
  )
)

plotPhenoData(
  reads,
  aes_string(
      x = "total_features",
      y = "pct_counts_ERCC",
      colour = "batch"
  )
)

filter_by_ERCC <-
reads$batch != "NA19098.r2" & reads$pct_counts_ERCC < 25
table(filter_by_ERCC)
## filter_by_ERCC
## FALSE TRUE
## 103 761
filter_by_MT <- reads$pct_counts_MT < 30
table(filter_by_MT)## filter_by_MT
## FALSE TRUE
## 18 846
reads$use <- (
# sufficient features (genes)
filter_by_expr_features &
# sufficient molecules counted
filter_by_total_counts &
# sufficient endogenous RNA
filter_by_ERCC &
# remove cells with unusual number of reads in MT genes
filter_by_MT
)
table(reads$use)
##
## FALSE TRUE
## 258 606
reads <- plotPCA(
  reads,
  size_by = "total_features",
  shape_by = "use",
  pca_data_input = "pdata",
  detect_outliers = TRUE,
  return_SCE = TRUE
)

table(reads$outlier)
##
## FALSE TRUE
## 756 108
library(limma)
##
## Attaching package: 'limma'
## The following object is masked from 'package:scater':
##
## plotMDS
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
auto <- colnames(reads)[reads$outlier]
man <- colnames(reads)[!reads$use]
venn.diag <- vennCounts(
cbind(colnames(reads) %in% auto,
colnames(reads) %in% man)
)
vennDiagram(
venn.diag,
names = c("Automatic", "Manual"),
circle.col = c("blue", "green")
)

plotQC(reads, type = "highest-expression")

filter_genes <- apply(
counts(reads[, colData(reads)$use]),
1,
function(x) length(x[x > 1]) >= 2
)
rowData(reads)$use <- filter_genes
table(filter_genes)
## filter_genes
## FALSE TRUE
## 2664 16062
dim(reads[rowData(reads)$use, colData(reads)$use])
## [1] 16062 606
assay(reads, "logcounts_raw") <- log2(counts(reads) + 1)
reducedDim(reads) <- NULL
saveRDS(reads, file = "tung/reads.rds")

通过比较图7.6和图7.13,很明显基于read的过滤比基于UMI的分析去除了更多的细胞。如果您返回并比较结果,您应该能够得出结论,ERCC和MT过滤器对于基于read的分析更严格。

sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
##
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.19.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 methods stats graphics grDevices utils
## [8] datasets base
##
## other attached packages:
## [1] limma_3.34.9 scater_1.6.3
## [3] ggplot2_2.2.1 SingleCellExperiment_1.0.0
## [5] SummarizedExperiment_1.8.1 DelayedArray_0.4.1
## [7] matrixStats_0.53.1 Biobase_2.38.0
## [9] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0
## [11] IRanges_2.12.0 S4Vectors_0.16.0
## [13] BiocGenerics_0.24.0 knitr_1.20
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.2 plyr_1.8.4 lazyeval_0.2.1
## [4] sp_1.2-7 shinydashboard_0.6.1 splines_3.4.3
## [7] digest_0.6.15 htmltools_0.3.6 viridis_0.5.0
## [10] magrittr_1.5 memoise_1.1.0 cluster_2.0.6
## [13] prettyunits_1.0.2 colorspace_1.3-2 blob_1.1.0
## [16] rrcov_1.4-3 xfun_0.1 dplyr_0.7.4
## [19] RCurl_1.95-4.10 tximport_1.6.0 lme4_1.1-15
## [22] bindr_0.1 zoo_1.8-1 glue_1.2.0
## [25] gtable_0.2.0 zlibbioc_1.24.0 XVector_0.18.0
## [28] MatrixModels_0.4-1 car_2.1-6 kernlab_0.9-25
## [31] prabclus_2.2-6 DEoptimR_1.0-8 SparseM_1.77
## [34] VIM_4.7.0 scales_0.5.0 sgeostat_1.0-27
## [37] mvtnorm_1.0-7 DBI_0.7 GGally_1.3.2
## [40] edgeR_3.20.9 Rcpp_0.12.15 sROC_0.1-2
## [43] viridisLite_0.3.0 xtable_1.8-2 progress_1.1.2
## [46] laeken_0.4.6 bit_1.1-12 mclust_5.4
## [49] vcd_1.4-4 httr_1.3.1 RColorBrewer_1.1-2
## [52] fpc_2.1-11 modeltools_0.2-21 pkgconfig_2.0.1
## [55] reshape_0.8.7 XML_3.98-1.10 flexmix_2.3-14
## [58] nnet_7.3-12 locfit_1.5-9.1 labeling_0.3
## [61] rlang_0.2.0 reshape2_1.4.3 AnnotationDbi_1.40.0
## [64] munsell_0.4.3 tools_3.4.3 RSQLite_2.0
## [67] pls_2.6-0 evaluate_0.10.1 stringr_1.3.0
## [70] cvTools_0.3.2 yaml_2.1.17 bit64_0.9-7
## [73] robustbase_0.92-8 bindrcpp_0.2 nlme_3.1-129
## [76] mime_0.5 quantreg_5.35 biomaRt_2.34.2
## [79] compiler_3.4.3 pbkrtest_0.4-7 beeswarm_0.2.3
## [82] e1071_1.6-8 tibble_1.4.2 robCompositions_2.0.6
## [85] pcaPP_1.9-73 stringi_1.1.6 highr_0.6
## [88] lattice_0.20-34 trimcluster_0.1-2 Matrix_1.2-7.1
## [91] nloptr_1.0.4 pillar_1.2.1 lmtest_0.9-35
## [94] data.table_1.10.4-3 cowplot_0.9.2 bitops_1.0-6
## [97] httpuv_1.3.6.1 R6_2.2.2 bookdown_0.7
## [100] gridExtra_2.3 vipor_0.4.5 boot_1.3-18
## [103] MASS_7.3-45 assertthat_0.2.0 rhdf5_2.22.0
## [106] rprojroot_1.3-2 rjson_0.2.15 GenomeInfoDbData_1.0.0
## [109] diptest_0.75-7 mgcv_1.8-23 grid_3.4.3
## [112] class_7.3-14 minqa_1.2.4 rmarkdown_1.8
## [115] mvoutlier_2.0.9 shiny_1.0.5 ggbeeswarm_0.6.0

(0)

相关推荐

  • GDCRNATools|ceRNA套路终结者Part1

    欢迎来到医科研,这里是白介素2的读书笔记,跟我一起聊临床与科研的故事, 生物医学数据挖掘,R语言,TCGA.GEO数据挖掘. 1 GDCRNATools-TCGA数据挖掘 1.1 基本介绍 TCGA ...

  • 使用R语言做极大似然估计实例

    原文链接:http://tecdat.cn/?p=18970 在普遍的理解中,最大似然估计是使用已知的样本结果信息来反向推断最有可能导致这些样本结果的模型参数值! 换句话说,最大似然估计提供了一种在给 ...

  • 表达矩阵处理—表达质量的控制

    书籍翻译 好的书籍是人类进步的阶梯,但有些人却找不到优秀的阶梯,为此我们开设了书籍翻译这个栏目,作为你学习之路的指路明灯:分享国内外优秀书籍,弘扬分享精神,做一个知识的传播者. 希望大家能有所收获! ...

  • 表达矩阵可视化大全

    貌代码被折叠了,大家需要阅读原文才能复制粘贴我代码在Rstudio里面直接运行,几分钟就可以学会15个图的制作! basic visualization for expression matrix j ...

  • lncRNA实战项目-第四步-得到表达矩阵的流程

    响应生信技能树的号召:lncRNA数据分析传送门 , 一起来一个lncRNA数据分析实战! SRA->FASTQ->BAM->COUNTS 这几个步骤而已,中间穿插一些质控的手段,每 ...

  • 从GEO数据库下载得到表达矩阵 一文就够

    在第一讲我们详细介绍了GEO数据库的基础知识及规律,也了解了如何利用官方R包GEOquery来探索GEO数据库,当然,我的生信菜鸟团博客里面也从很多其它角度解析过它,欢迎大家自行搜索学习.总得来说,从 ...

  • 使用CGP数据库的表达矩阵进行药物反应预测

    发表这个算法的文章是:Clinical drug response can be predicted using baseline gene expression levels and in vitr ...

  • 要读源代码才能解决的报错-GEOquery下载表达矩阵缺样本名

    最近生信技能树的很多朋友反馈一个GEOquery的bug,而且这个错误对初学者来说,是不可能解决的问题,值得分享一下!(2018-11-27 计) 就是昨天推文末尾的小测试: GEOquery包的ge ...

  • 如果表达矩阵被归一化会发生什么

    事情起源于,某个吃了很多汉堡王一起学习的日子,技能树一众学徒一起学习从GEO数据挖掘到limma差异分析等等等. 选的数据集倒是很有.趣.依据原始表达矩阵随便画的热图是这样的

  • 表达矩阵的归一化和标准化,去除极端值,异常值

    我们阅读量破万的综述:RNA-seq这十年(3万字长文综述)给粉丝朋友们带来了很多理解上的挑战,所以我们开辟专栏慢慢介绍其中的一些概念性的问题,上一期: RNA-seq的counts值,RPM, RP ...

  • 转录组表达矩阵为什么需要主成分分析以及怎么做

    我们阅读量破万的综述:RNA-seq这十年(3万字长文综述)给粉丝朋友们带来了很多理解上的挑战: 所以我们开辟专栏慢慢介绍其中的一些概念性的问题,上一期:表达矩阵的归一化和标准化,去除极端值,异常值 ...