DESeq2差异表达分析(二) / 开普饭

质量控制——样品水平

DESeq2工作流程的下一步是QC，它包括样本级和基因级的步骤，对计数数据执行QC检查，以帮助我们确保样本/重复看起来很好。

RNA-SEQ分析的一个有用的初始步骤是评估样本之间的总体相似性：

哪些样本彼此相似，哪些不同？
这是否符合实验设计的预期？
数据集中的主要变异来源是什么？

为了探索样本的相似性，我们将使用主成分分析(PCA)和层次聚类方法进行样本级质量控制。样本级的质量控制使我们能够看到我们的重复聚在一起有多好，以及观察我们的实验条件是否代表了数据中的主要变异源。执行样本级质量控制还可以识别任何样本异常值，这些异常值可能需要进一步研究，以确定是否需要在进行DE分析之前将其移除。

当使用这些无监督聚类方法时，计数的归一化和log2变换提高了可视化的距离/聚类。DESeq2使用中位数比率法进行计数归一化，并对样本级QC的归一化计数进行regularized log transform(rlog)，因为它缓和了平均值之间的方差，从而改善聚集性。

注意 : DESeq2 vignette 建议大型数据集(100个样本)使用variance-stabilizing transformation (VST)而不是rlog来转换计数，因为rlog函数运行时间可能太长，而且 vst() 函数运行速度更快，其属性与rlog相似。

PCA(Principal component analysis)

主成分分析(PCA)是一种用于强调数据集中的变异和产生强模式(降维)的技术。有关PCA的详细信息，请参阅我们的附加材料。(https://hbctraining.github.io/DGE_workshop_salmon/schedule/)

我们可以从DESeq2运行 rlog() 函数来归一化和rlog转换原始计数。然后，我们可以使用 plotPCA() 函数绘制前两个主成分。

# Transform counts for data visualization rld <- rlog(dds, blind=TRUE)


# Plot PCA

DESeq2::plotPCA(rld, intgroup = "group_id")

我们看到PC1上的样本与我们感兴趣的条件之间有很好的分离，这很好；这表明我们感兴趣的条件是数据集中最大的变异源。我们还看到PC2对样本进行了一些分隔；但是，由于缺乏额外的元数据可供探索，目前还不确定这可能是由于什么原因造成的。

Hierarchical clustering

与PCA类似，层次聚类是另一种互补的方法，用于识别数据集中的强模式和潜在的离群值。热图显示了数据集中所有样本成对组合的基因表达相关性。由于大多数基因没有差异表达，样本之间通常有很高的相关性(值高于0.80)。低于0.80的样品可能表示您的数据和/或样品污染中存在异常值。

层次树可以基于归一化的基因表达值来指示哪些样本彼此更相似。颜色块表示数据中的子结构，您可能会看到重复群集作为一个样本组的块。此外，我们预计会看到类似于PCA图中观察到的分组的样本群集。

# Extract the rlog matrix from the object and compute pairwise correlation values rld_mat <- assay(rld) rld_cor <- cor(rld_mat)

# Plot heatmap pheatmap(rld_cor, annotation = cluster_metadata[, c("group_id"), drop=F])

现在，我们确定是否有任何需要删除的异常值，或者我们可能想要在设计公式中回归的额外的变异源。由于我们没有通过PCA或层次聚类检测到异常值，也没有任何额外的变异源需要回归，所以我们可以继续运行差异表达分析。

Running DESeq2

使用DESeq2进行差异表达分析涉及多个步骤，如下面的蓝色流程图所示。简而言之，DESeq2将对原始计数进行建模，使用归一化因子(大小因子)来考虑库深度的差异。然后，它将估算基因离散度，并缩小这些估计值，以生成更准确的离散度估计值，从而对计数进行建模。最后，DESeq2将拟合负二项模型，并使用Wald检验或似然比检验进行假设检验。所有这些步骤都在我们的附加材料中进行了详细说明（https://hbctraining.github.io/DGE_workshop_salmon/schedule/）。

所有这些步骤都是通过在前面创建的DESeq2对象上运行单个DESeq()函数来执行的。

# Run DESeq2 differential expression analysis dds <- DESeq(dds)

我们可以通过查看离散度估计的曲线图来检查模型与我们的数据的匹配性。

# Plot dispersion estimates plotDispEsts(dds)

这个图结果很棒，因为我们预计我们的离散度将随着均值的增加而减小，并遵循最佳拟合线。

Results

既然我们已经执行了差异表达式分析，我们就可以查看特定比较的结果了。为了对感兴趣的比较，我们需要指定对比度并执行log2 fold changes。

让我们将实验组与对照组进行比较：

# Output results of Wald test for contrast for stim vs ctrl levels(cluster_metadata$group_id)[2] levels(cluster_metadata$group_id)[1]


contrast <- c("group_id", levels(cluster_metadata$group_id)[2], levels(cluster_metadata$group_id)[1])
# resultsNames(dds)

res <- results(dds, 

               contrast = contrast,

               alpha = 0.05)

res <- lfcShrink(dds, contrast = contrast, res=res)

我们将输出对我们重要的基因，并执行几种不同的可视化技术来探索我们的结果：

所有基因的结果表
显著基因结果表(padj<0.05)
top20最显著的基因归一化表达散点图
所有显著基因的热图
结果的火山图

所有基因的结果表

首先，让我们为所有结果生成结果表：

# Turn the results object into a tibble for use with tidyverse functions res_tbl <- res %>% data.frame() %>% rownames_to_column(var="gene") %>% as_tibble()


# Check results output

res_tbl

# Write all results to file write.csv(res_tbl, paste0("results/", clusters[1], "_", levels(cluster_metadata$sample)[2], "_vs_", levels(cluster_metadata$sample)[1], "_all_genes.csv"), quote = FALSE, row.names = FALSE)

显著基因结果表

接下来，我们可以使用p调整阈值0.05来筛选表中的重要基因

# Set thresholds padj_cutoff <- 0.05


# Subset the significant results

sig_res <- dplyr::filter(res_tbl, padj < padj_cutoff) %>%

        dplyr::arrange(padj)
# Check significant genes output

sig_res

# Write significant results to file write.csv(sig_res, paste0("results", clusters[1], "_", levels(cluster_metadata$sample)[2], "_vs_", levels(cluster_metadata$sample)[1], "_sig_genes.csv"), quote = FALSE, row.names = FALSE)

top20最显著的基因归一化表达散点图

既然我们已经确定了显著的基因，我们就可以画出前20个显著基因的散点图了。此图是一个很好的检查，以确保我们也正确地解释了fold change values。

## ggplot of top genes normalized_counts <- counts(dds, normalized = TRUE)


## Order results by padj values

top20_sig_genes <- sig_res %>%

        dplyr::arrange(padj) %>%

        dplyr::pull(gene) %>%

        head(n=20)
top20_sig_norm <- data.frame(normalized_counts) %>%

        rownames_to_column(var = "gene") %>%

        dplyr::filter(gene %in% top20_sig_genes)
gathered_top20_sig <- top20_sig_norm %>%

        gather(colnames(top20_sig_norm)[2:length(colnames(top20_sig_norm))], key = "samplename", value = "normalized_counts")

        

gathered_top20_sig <- inner_join(ei[, c("sample_id", "group_id" )], gathered_top20_sig, by = c("sample_id" = "samplename"))

## plot using ggplot2 ggplot(gathered_top20_sig) + geom_point(aes(x = gene, y = normalized_counts, color = group_id), position=position_jitter(w=0.1,h=0)) + scale_y_log10() + xlab("Genes") + ylab("log10 Normalized Counts") + ggtitle("Top 20 Significant DE Genes") + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + theme(plot.title = element_text(hjust = 0.5))

所有显著基因的热图

我们还可以利用热图探索显著基因的聚集性。

# Extract normalized counts for only the significant genes sig_norm <- data.frame(normalized_counts) %>% rownames_to_column(var = "gene") %>% dplyr::filter(gene %in% sig_res$gene) # Set a color palette heat_colors <- brewer.pal(6, "YlOrRd")

# Run pheatmap using the metadata data frame for the annotation pheatmap(sig_norm[ , 2:length(colnames(sig_norm))], color = heat_colors, cluster_rows = T, show_rownames = F, annotation = cluster_metadata[, c("group_id", "cluster_id")], border_color = NA, fontsize = 10, scale = "row", fontsize_row = 10, height = 20)

结果的火山图

## Obtain logical vector where TRUE values denote padj values < 0.05 and fold change > 1.5 in either direction res_table_thres <- res_tbl %>% mutate(threshold = padj < 0.05 & abs(log2FoldChange) >= 0.58) ## Volcano plot ggplot(res_table_thres) + geom_point(aes(x = log2FoldChange, y = -log10(padj), colour = threshold)) + ggtitle("Volcano plot of stimulated B cells relative to control") + xlab("log2 fold change") + ylab("-log10 adjusted p-value") + scale_y_continuous(limits = c(0,50)) + theme(legend.position = "none", plot.title = element_text(size = rel(1.5), hjust = 0.5), axis.title = element_text(size = rel(1.25)))

采用有效的脚本对多个不同细胞类型群集进行分析，可使用用于成对比较的Wald检验或用于多组比较的似然比检验 。

在所有细胞类型群集上运行DESeq2-Wald测试的脚本

下面的脚本将在所有细胞类型集群上运行DESeq2，同时使用Wald测试将感兴趣的条件的每个级别与所有其他级别进行对比。该脚本可以很容易地在集群上运行，以便快速高效地执行和存储结果。

dir.create("DESeq2") dir.create("DESeq2/pairwise")


# Function to run DESeq2 and get results for all clusters

## x is index of cluster in clusters vector on which to run function

## A is the sample group to compare

## B is the sample group to compare against (base level)
get_dds_resultsAvsB <- function(x, A, B){

        cluster_metadata <- metadata[which(metadata$cluster_id == clusters[x]), ]

        rownames(cluster_metadata) <- cluster_metadata$sample_id

        counts <- pb[[clusters[x]]]

        cluster_counts <- data.frame(counts[, which(colnames(counts) %in% rownames(cluster_metadata))])

        

        #all(rownames(cluster_metadata) == colnames(cluster_counts))        

        

        dds <- DESeqDataSetFromMatrix(cluster_counts, 

                                      colData = cluster_metadata, 

                                      design = ~ group_id)

        

        # Transform counts for data visualization

        rld <- rlog(dds, blind=TRUE)

        

        # Plot PCA

        

        DESeq2::plotPCA(rld, intgroup = "group_id")

        ggsave(paste0("results/", clusters[x], "_specific_PCAplot.png"))

        

        

        # Extract the rlog matrix from the object and compute pairwise correlation values

        rld_mat <- assay(rld)

        rld_cor <- cor(rld_mat)

        

        # Plot heatmap

        png(paste0("results/", clusters[x], "_specific_heatmap.png"))

        pheatmap(rld_cor, annotation = cluster_metadata[, c("group_id"), drop=F])

        dev.off()

        

        # Run DESeq2 differential expression analysis

        dds <- DESeq(dds)

        

        # Plot dispersion estimates

        png(paste0("results/", clusters[x], "_dispersion_plot.png"))

        plotDispEsts(dds)

        dev.off()
# Output results of Wald test for contrast for A vs B

        contrast <- c("group_id", levels(cluster_metadata$group_id)[A], levels(cluster_metadata$group_id)[B])

        

        # resultsNames(dds)

        res <- results(dds, 

                       contrast = contrast,

                       alpha = 0.05)

        

        res <- lfcShrink(dds, 

                         contrast =  contrast,

                         res=res)

        # Set thresholds

        padj_cutoff <- 0.05

        

        # Turn the results object into a tibble for use with tidyverse functions

        res_tbl <- res %>%

                data.frame() %>%

                rownames_to_column(var="gene") %>%

                as_tibble()

        

        write.csv(res_tbl,

                  paste0("DESeq2/pairwise/", clusters[x], "_", levels(cluster_metadata$group_id)[A], "_vs_", levels(cluster_metadata$group_id)[B], "_all_genes.csv"),

                  quote = FALSE, 

                  row.names = FALSE)

        

        # Subset the significant results

        sig_res <- dplyr::filter(res_tbl, padj < padj_cutoff) %>%

                dplyr::arrange(padj)

        

        write.csv(sig_res,

                  paste0("DESeq2/pairwise/", clusters[x], "_", levels(cluster_metadata$group_id)[A], "_vs_", levels(cluster_metadata$group_id)[B], "_sig_genes.csv"),

                  quote = FALSE, 

                  row.names = FALSE)

        

        ## ggplot of top genes

        normalized_counts <- counts(dds, 

                                    normalized = TRUE)

        

        ## Order results by padj values

        top20_sig_genes <- sig_res %>%

                dplyr::arrange(padj) %>%

                dplyr::pull(gene) %>%

                head(n=20)

        

        

        top20_sig_norm <- data.frame(normalized_counts) %>%

                rownames_to_column(var = "gene") %>%

                dplyr::filter(gene %in% top20_sig_genes)

        

        gathered_top20_sig <- top20_sig_norm %>%

                gather(colnames(top20_sig_norm)[2:length(colnames(top20_sig_norm))], key = "samplename", value = "normalized_counts")

        

        gathered_top20_sig <- inner_join(ei[, c("sample_id", "group_id" )], gathered_top20_sig, by = c("sample_id" = "samplename"))

        

        ## plot using ggplot2

        ggplot(gathered_top20_sig) +

                geom_point(aes(x = gene, 

                               y = normalized_counts, 

                               color = group_id), 

                           position=position_jitter(w=0.1,h=0)) +

                scale_y_log10() +

                xlab("Genes") +

                ylab("log10 Normalized Counts") +

                ggtitle("Top 20 Significant DE Genes") +

                theme_bw() +

                theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 

                theme(plot.title = element_text(hjust = 0.5))

        ggsave(paste0("DESeq2/pairwise/", clusters[x], "_", levels(cluster_metadata$group_id)[A], "_vs_", levels(cluster_metadata$group_id)[B], "_top20_DE_genes.png"))

        

}
# Run the script on all clusters comparing stim condition relative to control condition

map(1:length(clusters), get_dds_resultsAvsB, A = 2, B = 1)

Script to run DESeq2 on all cell type clusters - Likelihood Ratio Test

The following script will run the DESeq2 Likelihood Ratio Test (LRT) on all cell type clusters. This script can easily be run on the cluster for fast and efficient execution and storage of results.
# Likelihood ratio test

dir.create("DESeq2/lrt")
# Create DESeq2Dataset object

clusters <- levels(metadata$cluster_id)
metadata <- gg_df %>%

        select(cluster_id, sample_id, group_id)
metadata$group <- paste0(metadata$cluster_id, "_", metadata$group_id) %>%

        factor()
# DESeq2

library(DEGreport)

get_dds_LRTresults <- function(x){

        

        cluster_metadata <- metadata[which(metadata$cluster_id == clusters[x]), ]

        rownames(cluster_metadata) <- cluster_metadata$sample_id

        counts <- pb[[clusters[x]]]

        cluster_counts <- data.frame(counts[, which(colnames(counts) %in% rownames(cluster_metadata))])

        

        #all(rownames(cluster_metadata) == colnames(cluster_counts))        

        

        dds <- DESeqDataSetFromMatrix(cluster_counts, 

                                      colData = cluster_metadata, 

                                      design = ~ group_id)

        

        dds_lrt <- DESeq(dds, test="LRT", reduced = ~ 1)

        

        # Extract results

        res_LRT <- results(dds_lrt)

        

        # Create a tibble for LRT results

        res_LRT_tb <- res_LRT %>%

                data.frame() %>%

                rownames_to_column(var="gene") %>% 

                as_tibble()

        

        # Save all results

        write.csv(res_LRT_tb,

                  paste0("DESeq2/lrt/", clusters[x], "_LRT_all_genes.csv"),

                  quote = FALSE, 

                  row.names = FALSE)

        

        # Subset to return genes with padj < 0.05

        sigLRT_genes <- res_LRT_tb %>% 

                filter(padj < 0.05)

        

        # Save sig results

        write.csv(sigLRT_genes,

                  paste0("DESeq2/lrt/", clusters[x], "_LRT_sig_genes.csv"),

                  quote = FALSE, 

                  row.names = FALSE)

        

        # Transform counts for data visualization

        rld <- rlog(dds_lrt, blind=TRUE)

        

        # Extract the rlog matrix from the object and compute pairwise correlation values

        rld_mat <- assay(rld)

        rld_cor <- cor(rld_mat)

        

        

        # Obtain rlog values for those significant genes

        cluster_rlog <- rld_mat[sigLRT_genes$gene, ]

        

        cluster_meta_sig <- cluster_metadata[which(rownames(cluster_metadata) %in% colnames(cluster_rlog)), ]

        

        # # Remove samples without replicates

        # cluster_rlog <- cluster_rlog[, -1]

        # cluster_metadata <- cluster_metadata[which(rownames(cluster_metadata) %in% colnames(cluster_rlog)), ]

        

        

        # Use the `degPatterns` function from the 'DEGreport' package to show gene clusters across sample groups

        cluster_groups <- degPatterns(cluster_rlog, metadata = cluster_meta_sig, time = "group_id", col=NULL)

        ggsave(paste0("DESeq2/lrt/", clusters[x], "_LRT_DEgene_groups.png"))

        

        # Let's see what is stored in the `df` component

        write.csv(cluster_groups$df,

                  paste0("DESeq2/lrt/", clusters[x], "_LRT_DEgene_groups.csv"),

                  quote = FALSE, 

                  row.names = FALSE)

        

        saveRDS(cluster_groups, paste0("DESeq2/lrt/", clusters[x], "_LRT_DEgene_groups.rds"))

        save(dds_lrt, cluster_groups, res_LRT, sigLRT_genes, file = paste0("DESeq2/lrt/", clusters[x], "_all_LRTresults.Rdata"))

        

}

map(1:length(clusters), get_dds_LRTresults)

完!!!

注：以上内容来自哈佛大学生物信息中心(HBC)_的教学团队的生物信息学培训课程。原文链接：https://hbctraining.github.io/scRNA-seq/schedule/ 点击 “阅读原文” 可直达

看完记得顺手点个“在看”哦！

生物 | 单细胞 | 转录组丨资料

每天都精彩

DESeq2差异表达分析(二)