甲基化分析实战,将你的数据用在刀刃上!

2020年4分+甲基化生信文章复现

领悟生信文章逻辑之美,小伙伴们大家好呀,我是风间琉璃。上一周我们品读了2020年10月发表在《Frontiers in oncology》杂志上的文章“A Novel Promoter CpG-Based Signature for Long-Term Survival Prediction of Breast Cancer Patients”。作者使用CHAMP包进行筛选差异的甲基化位点。还没看上一期推文的同学们请看一遍,看过的同学可以复习一遍。“2020年了!!这样的文章也能发近5分!”。因为接下来我们会使用CHAMP包来适当复现文章中的流程。很精彩,不要错过喔~

期刊信息

下载数据

首先我们需要下载相关的甲基化数据以及表型(phenotype)数据。进入xena官网(https://xena.ucsc.edu/),选择数据集进行下载。如果不想自己找的同学没关系,我提供了下载地址。乳腺癌甲基化数据集下载链接(https://xenabrowser.net/datapages/?dataset=TCGA-BRCA.methylation450.tsv&host=https%3A%2F%2Fgdc.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443)。表型数据下载链接(https://xenabrowser.net/datapages/?dataset=TCGA-BRCA.GDC_phenotype.tsv&host=https%3A%2F%2Fgdc.xenahubs.net&removeHub=https%3A%2F%2Fxena.treehouse.gi.ucsc.edu%3A443)。

数据读取及数据清洗

在读取之前注意,我们R语言的工作路径一定要和我们文件的目录一致,才能进行以下步骤。

#加载数据library(tidyverse)data_beta=read_tsv(file = 'TCGA-BRCA.methylation450.tsv')data_samp=read_tsv(file = 'phenotype.tsv')

读取数据之后,我们需要对数据进行初步清洗。

###查看甲基化数据和表型数据的维度和数据格式dim(data_beta)## [1] 485577    891dim(data_samp)## [1] 1284  140class(data_beta)## [1] 'spec_tbl_df' 'tbl_df'      'tbl'         'data.frame'class(data_samp)## [1] 'spec_tbl_df' 'tbl_df'      'tbl'         'data.frame'######提取肿瘤和正常样本的信息table(data_samp$sample_type.samples)## ##          Metastatic       Primary Tumor Solid Tissue Normal ##                   7                1114                 162##我们只要'Primary Tumor'和'Solid Tissue Normal'tissue=c('Primary Tumor','Solid Tissue Normal')data_samp=data_samp[data_samp$sample_type.samples%in%tissue,]###提取病人编号和样本信息pdata=data_samp[,c('submitter_id.samples' ,'sample_type.samples')]###接下来具有甲基化数据和表型数据之间需要对病人信息取交集ID=intersect(pdata$submitter_id.samples, colnames(data_beta))##一共885个样本,文章提示一共为890个样本,疾病组794个,正常组96个。看一下我们##自己的,正常样本96没问题,肿瘤组织789个,少了5个,关系不大。#提取具有共同交集的sample信息pdata=pdata[pdata$submitter_id.samples%in%ID,] #查看整理好的表型数据dim(pdata)## [1] 885   2#更改一下表型数据的列名,并把结果输出为data_samp数据names(pdata) <- c('sample_name','sample_group')data_samp <- pdata#提取具有共同交集的beta矩阵names(data_beta)[1]='CpG'data_beta=column_to_rownames(data_beta,'CpG') data_beta=data_beta[,ID]####清洁数据不易,记得保存save(data_beta,data_samp,file = 'methylation.rds')

接下来,我们需要开始将清洁数据导入CHAMP包中。

数据导入

#加载数据和包library(ChAMP)library(tidyverse)## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4## ✓ tibble 3.0.4 ✓ dplyr 1.0.2## ✓ tidyr 1.1.2 ✓ stringr 1.4.0## ✓ readr 1.4.0 ✓ forcats 0.5.0## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──## x purrr::accumulate() masks foreach::accumulate()## x dplyr::collapse() masks Biostrings::collapse(), IRanges::collapse()## x dplyr::combine() masks minfi::combine(), Biobase::combine(), BiocGenerics::combine()## x purrr::compact() masks XVector::compact()## x dplyr::count() masks matrixStats::count()## x dplyr::desc() masks IRanges::desc()## x tidyr::expand() masks S4Vectors::expand()## x dplyr::filter() masks stats::filter()## x dplyr::first() masks S4Vectors::first()## x dplyr::lag() masks stats::lag()## x purrr::none() masks locfit::none()## x ggplot2::Position() masks BiocGenerics::Position(), base::Position()## x purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()## x dplyr::rename() masks S4Vectors::rename()## x purrr::simplify() masks DelayedArray::simplify()## x dplyr::slice() masks XVector::slice(), IRanges::slice()## x purrr::when() masks foreach::when()load('methylation.rds')##增加一列Sample_Group,肿瘤组织为T,正常组织为C。方便后续差异分析data_samp$Sample_Group <- if_else(data_samp$sample_group=='Primary Tumor', data_samp$Sample_Group <- 'T', data_samp$Sample_Group <- 'C')data_samp <- as.data.frame(data_samp)###beta矩阵的处理##有以下几点,1.保持beta矩阵的列名和表型数据的样本名一致,这很重要!##2.beta矩阵里不能有NA##3.beta矩阵中最小值如果为0,可以加上0.00001防止报错##4.beta矩阵必须为矩阵数据格式data_order=data_beta[,data_samp$sample_name]data_order=as.matrix(data_order)sum(is.na(data_order))## [1] 0data_order=data_order+0.00001###CHAMP包导入myLoad=champ.filter(beta = data_order,pd = data_samp)## [===========================]## [<<<< ChAMP.FILTER START >>>>>]## -----------------------------## ## In New version ChAMP, champ.filter() function has been set to do filtering on the result of champ.import(). You can use champ.import() + champ.filter() to do Data Loading, or set 'method' parameter in champ.load() as 'ChAMP' to get the same effect.## ## This function is provided for user need to do filtering on some beta (or M) matrix, which contained most filtering system in champ.load except beadcount. User need to input beta matrix, pd file themselves. If you want to do filterintg on detP matrix and Bead Count, you also need to input a detected P matrix and Bead Count information.## ## Note that if you want to filter more data matrix, say beta, M, intensity... please make sure they have exactly the same rownames and colnames.## ## ## [ Section 1: Check Input Start ]## You have inputed beta for Analysis.## ## pd file provided, checking if it's in accord with Data Matrix...## !!! Your pd file does not have Sample_Name column, we can not check your Sample_Name, please make sure the pd file is correct.## ## Parameter filterDetP is TRUE, checking if detP in accord with Data Matrix...## !!! Parameter detP is not found, filterDetP is reset FALSE now.## ## Parameter filterBeads is TRUE, checking if beadcount in accord with Data Matrix...## !!! Parameter beadcount is not found, filterBeads is reset FALSE now.## ## parameter autoimpute is TRUE. Checking if the conditions are fulfilled...## !!! ProbeCutoff is 0, which means you have no needs to do imputation. autoimpute has been reset FALSE.## ## Checking Finished :filterMultiHit,filterSNPs,filterNoCG,filterXY would be done on beta.## [ Section 1: Check Input Done ]## ## ## [ Section 2: Filtering Start >>## ## Filtering NoCG Start## Only Keep CpGs, removing 3156 probes from the analysis.## ## Filtering SNPs Start## Using general 450K SNP list for filtering.## Filtering probes with SNPs as identified in Zhou's Nucleic Acids Research Paper 2016.## Removing 59901 probes from the analysis.## ## Filtering MultiHit Start## Filtering probes that align to multiple locations as identified in Nordlund et al## Removing 11 probes from the analysis.## ## Filtering XY Start## Filtering probes located on X,Y chromosome, removing 10028 probes from the analysis.## ## Updating PD file## filterDetP parameter is FALSE, so no Sample Would be removed.## ## Fixing Outliers Start## Replacing all value smaller/equal to 0 with smallest positive value.## Replacing all value greater/equal to 1 with largest value below 1..## [ Section 2: Filtering Done ]## ## All filterings are Done, now you have 412481 probes and 885 samples.## [<<<<< ChAMP.FILTER END >>>>>>]## [===========================]## [You may want to process champ.QC() next.]##成功导入后及时保存#save(myLoad,file='meth_load.rds')###另外800多个病人,速度会比较慢。。。。# ##标准化,我们使用文章中使用的BMIQ法进行标准化。core值根据自己分析环境设置# myNorm <- champ.norm(beta=myLoad$beta,arraytype='450K',cores=100)# ##同样及时保存,如果是自己的电脑就不要跑这一步了,会崩的,所以我就不演示了# save(myNorm,file='meth_norm.rds')# #标准化后得到新的数据##标准化后数据质控#champ.QC()####甲基化差异位点分析myDMP <- champ.DMP(beta = myLoad$beta ,pheno=myLoad$pd$Sample_Group)## [===========================]## [<<<<< ChAMP.DMP START >>>>>]## -----------------------------## !!! Important !!! New Modification has been made on champ.DMP():## (1): In this version champ.DMP() if your pheno parameter contains more than two groups of phenotypes, champ.DMP() would do pairewise differential methylated analysis between each pair of them. But you can also specify compare.group to only do comparasion between any two of them.## (2): champ.DMP() now support numeric as pheno, and will do linear regression on them. So covariates like age could be inputted in this function. You need to make sure your inputted 'pheno' parameter is 'numeric' type.## --------------------------------## ## [ Section 1: Check Input Pheno Start ]## You pheno is character type.## Your pheno information contains following groups. >>## <T>:789 samples.## <C>:96 samples.## [The power of statistics analysis on groups contain very few samples may not strong.]## pheno contains only 2 phenotypes## compare.group parameter is NULL, two pheno types will be added into Compare List.## T_to_C compare group : T, C## ## [ Section 1: Check Input Pheno Done ]## ## [ Section 2: Find Differential Methylated CpGs Start ]## -----------------------------## Start to Compare : T, C## Contrast Matrix## Contrasts## Levels pT-pC## pC -1## pT 1## You have found 241364 significant MVPs with a BH adjusted P-value below 0.05.## Calculate DMP for T and C done.## ## [ Section 2: Find Numeric Vector Related CpGs Done ]## ## [ Section 3: Match Annotation Start ]## ## [ Section 3: Match Annotation Done ]## [<<<<<< ChAMP.DMP END >>>>>>]## [===========================]## [You may want to process DMP.GUI() or champ.GSEA() next.]#查看我们的分析结果head(myDMP[[1]])## logFC AveExpr t P.Value adj.P.Val B## cg19533977 -0.3539830 0.1261480 -39.40782 8.157138e-197 3.364664e-191 439.1822## cg18269493 -0.3400004 0.1738861 -38.73995 1.171007e-192 2.415092e-187 429.6075## cg18265162 -0.2540523 0.1249706 -37.97257 7.378373e-188 1.014480e-182 418.5535## cg17046586 0.3694429 0.8027515 37.92441 1.478829e-187 1.524972e-182 417.8581## cg14789818 0.4325909 0.7772445 37.46889 1.073826e-184 8.858657e-180 411.2687## cg01393234 -0.3421082 0.1641259 -36.85815 7.573636e-181 5.206635e-176 402.4054## T_AVG C_AVG deltaBeta CHR MAPINFO Strand Type gene## cg19533977 0.08774981 0.4417328 0.3539830 17 57719682 F II CLTC## cg18269493 0.13700474 0.4770051 0.3400004 1 8772984 F II RERE## cg18265162 0.09741234 0.3514647 0.2540523 1 210407588 R II C1orf133## cg17046586 0.84282668 0.4733838 -0.3694429 12 50665774 R II LIMA1## cg14789818 0.82416958 0.3915786 -0.4325909 1 227748712 R I ## cg01393234 0.12701586 0.4691241 0.3421082 1 226085079 F II ## feature cgi feat.cgi UCSC_CpG_Islands_Name DHS Enhancer## cg19533977 Body opensea Body-opensea NA TRUE## cg18269493 5'UTR opensea 5'UTR-opensea NA NA## cg18265162 TSS200 island TSS200-island chr1:210407401-210407637 TRUE NA## cg17046586 5'UTR opensea 5'UTR-opensea NA TRUE## cg14789818 IGR island IGR-island chr1:227748423-227748860 NA NA## cg01393234 IGR opensea IGR-opensea TRUE TRUE## Phantom Probe_SNPs Probe_SNPs_10## cg19533977 ## cg18269493 ## cg18265162 ## cg17046586 ## cg14789818 ## cg01393234####差异甲基化区域,DMR#myDMR <- champ.DMR(beta=myLoad$beta,pheno=myLoad$pd$Sample_Group,method='Bumphunter')#查看差异甲基化区域#head(myDMR$BumphunterDMR)

接下来,根据文章中的条件进行筛选。

差异甲基化位点

文章中的条件筛选包括:

(1)adj.P<0.05,delta |β|>0.2

(2)位于启动子区域(5’-UTR, TSS200, TSS1500 and 1stExon),那我们开始把。

#提取我们得到的差异甲基化位点矩阵feature_pro=c('1stExon','5'UTR','TSS1500','TSS200')data_dif <- myDMP[[1]] %>% filter(adj.P.Val<0.05&abs(deltaBeta)>0.2)data_dif <- data_dif[data_dif$feature%in%feature_pro,]

最后我们一共得到8778个位点,和作者的10088个差异甲基化位点有1000多个甲基化位点的差异,这可能是因为我们并没有做标准化这一步(数据量太大了,跑不了。。。。) 接下来我们绘制Figure 1B的热图。

复现热图

#加载包library(pheatmap)#设置注释列名的文件ann_colann_col <- data_sampann_col$Sample_Group <- if_else(data_samp$sample_group=='Primary Tumor', data_samp$Sample_Group <- 'Tumor', data_samp$Sample_Group <- 'Normal')ann_col <- ann_col %>% column_to_rownames(var = 'sample_name') %>% dplyr::select(-sample_group) %>%dplyr::select(-case_submitter_id) %>% arrange(Sample_Group)##根据差异甲基化位点提取beta矩阵子集data_heat=data_beta[rownames(data_dif),rownames(ann_col)]pheatmap(data_heat,color =colorRampPalette(c('navy', 'white',
'firebrick3'))(10),cluster_rows = F,cluster_cols = F, legend = F,show_rownames = F,show_colnames = F,annotation_col = ann_col)

好啦,热图的复现差不多就结束了。

—END—
(0)

相关推荐

  • dplyr学习(1)

    先写点无关紧要的: 本来想用Rmarkdown,毕竟可以输出每步结果.但是Rmarkdown生成的html格式比word更方便阅读.后来想想算了,就用最简单的吧. 笔记开始: 先记录一下:select ...

  • 学一学DNA甲基化芯片分析流程

    今天是生信星球陪你的第778天 大神一句话,菜鸟跑半年.我不是大神,但我可以缩短你走弯路的半年~ 就像歌儿唱的那样,如果你不知道该往哪儿走,就留在这学点生信好不好~ 这里有豆豆和花花的学习历程,从新手 ...

  • 甲基化信号值矩阵差异分析主要图表

    差异分析永远是最平易近人的策略,我们前面整理的各种系列教程,miRNA的,lncRNA的,mRNA芯片或者测序,circRNA系列的,都会得到表达矩阵,然后走差异分析.只不过是不同统计学分布的表达矩阵 ...

  • Python-生成模拟原始脑电数据

    更多技术干货第一时间送达 在实验中有时需要原始脑电数据来进行模拟实验,但又限于实验条件的不足,需要构造模拟的原始脑电数据. 本示例通过多次重复所需的源激活来生成原始数据. 案例介绍 # 导入工具包im ...

  • 甲基化芯片数据的一些质控指标

    前面我们介绍了一些背景知识,主要是理解什么是DNA甲基化,为什么要检测它,以及芯片和测序两个方向的DNA甲基化检测技术.具体介绍在:甲基化的一些基础知识,也了解了甲基化芯片的一般分析流程 . 然后下载 ...

  • 齐岳-Beta-雌二醇 3-(Beta-D-葡萄糖苷酸)钠盐,cas14982-12-8 雌二醇衍生物

    基本参数 [中文名]1,3,5(10)-雌三烯-3,17β-二醇 3-葡萄糖醛酸苷 [英文名]β-Estradiol 3-(β-D-glucuronide) sodium salt [中文别名]Bet ...

  • Enders SAMP/RAMP腙烷基化反应

    1976年D. Enders报道了通过(S)-1-氨基-2-甲氧基甲基吡咯烷 (SAMP)腙的衍生物对酮的α位进行不对称烷基化的反应.根据其反应步骤,SAMP腙在四氢呋喃中用LDA在α位脱质子,生成的 ...

  • 甲基化信号值矩阵差异分析哪家强

    前面我们介绍了一些背景知识,主要是理解什么是DNA甲基化,为什么要检测它,以及芯片和测序两个方向的DNA甲基化检测技术.具体介绍在:甲基化的一些基础知识,也了解了甲基化芯片的一般分析流程 . 但是我留 ...

  • 使用bowtie2和samblaster一步到位的干净比对

    bowtie2 以前都是和samtools组合,如下: bowtie2 -x $index -U $id | samtools sort -@ 4  -o $sample.bam - 运行速度很慢,现 ...

  • ChAMP 包分析450K甲基化芯片数据(一站式)

    早在我们举办甲基化芯片专题学习的时候,见: 450K甲基化芯片数据处理传送门 就有非常棒的一站式教程投稿,也因此我结识了优秀的六六,以及其教程大力推荐的R包作者,见: 850K甲基化芯片数据的分析 但 ...

  • 人名反应及常见反应(D-E)

    Dakin-West反应 将α-氨基酸和醋酸酐在碱性条件(吡啶)下反应通过噁唑啉(氨基酸内酯)中间体生成α-乙酰基氨基甲基酮并放出二氧化碳的反应. Dakin氧化反应 碱性条件下芳基醛酮通过过氧化氢氧 ...