下载TCGA所有癌症的maf文件做signature分析
才sanger研究所已经做好了这个分析,但是值得我们重复一下,效果如下:
首先TCGA所有癌症的maf文件
maf格式的mutation记录文件在TCGA里面已经是level4的数据啦,所以是完全open的,可以随意下载,只需要去其GDC官网简单点击,选择即可。
主要步骤就是在https://portal.gdc.cancer.gov/repository里面点击过滤文件类型,选择maf格式,再过滤access权限,选择open即可,最后得到的132个文件就是我们需要的。
总共是2.19GB的文件,每个癌症种类都有4种maf文件,分别是用mutect,muse,vanscan,somaticsniper这4款软件call 到的somatic mutation文件。
下载方式这里我选择下载它们132个文件的manifest文件,然后用GDC提供的官方工具来下载!关于这个工具,我 在生信技能树论坛写过教程,就不多说了,自己去看哈,现在下载TCGA数据也是非常方便,首先是GDC网站及客户端 就是安装成功后,运行 ./gdc-client download -m manifest_xxx.txt
j即可。这个manifest文件就是自己刚才创造并且下载的。
cd ~/institute/TCGA/GDC_NCBI/all
~/biosoft/GDC/gdc-client download -m gdc_manifest.2017-08-25T02-57-11.281090.txt
但是这个工具,提供的电脑操作系统版本有限哦If you are a user of CentOS 6 or RedHat Enterprise Release 6 and wish to use the Data Transfer Tool, contact the GDC Help Desk for assistance.
所以我是在MAC里面下载好了,再上传到我的服务器去的!
然后根据MAF文件制作signature
我是根据这篇文章Mutational Profile of Metastatic Breast Cancers: A Retrospective Analysis里面的方法来做的,他们的方法描述如下:
De novo mutational signature analysis was done using the Matlab Welcome Trust Sanger Institute’s signature framework. We used the deconstructSigs R package to determine the contribution of the known signatures that explain each sample mutational profile with more than 50 somatic mutations. We considered the 13 signatures (Signatures 1, 2, 3, 5, 6, 8, 10, 13, 17, 18, 20, 26, and 30) operative in breast cancer as defined in COSMIC (http://cancer.sanger.ac.uk/signatures/matrix.png). A signature was defined as operative or predominant if its contribution to the mutational pattern was respectively >25% (or >100 mutations) or >50%.
虽然我以前写过一个类似的教程,用SomaticSignatures包来解析maf突变数据获得mutation signature 这里还是再学习学习这个新的工具deconstructSigs R package吧。
这是个R包,所以直接在Rstudio里面安装即可,这里选取BRCA的somatic mutation的MAF文件做一下分析,看看四个软件找出的变异,是否在signature上面有差异。
install.packages('deconstructSigs')
# dependencies 'BSgenome', 'BSgenome.Hsapiens.UCSC.hg19'
BiocInstaller::biocLite('BSgenome')
BiocInstaller::biocLite('BSgenome.Hsapiens.UCSC.hg19')
## https://github.com/raerose01/deconstructSigs
file1='TCGA.STAD.muse..somatic.maf.gz'
TCGA.STAD.muse=read.table(file1,sep = '\t',quote="",header = T)
TCGA.STAD.muse[1:5,1:15]
## data frame including 5 columns: sample.ID,chr,pos,ref,alt
sample.mut.ref <- data.fram(Sample='TCGA.STAD.muse',
chr = TCGA.STAD.muse[,5],
pos = TCGA.STAD.muse[,6],
ref = TCGA.STAD.muse[,11],
alt = TCGA.STAD.muse[,13])
sigs.input <- mut.to.sigs.input(mut.ref = sample.mut.ref,
sample.id = "Sample",
chr = "chr",
pos = "pos",
ref = "ref",
alt = "alt")
class(sigs.input)
sample_1 = whichSignatures(tumor.ref = sigs.input,
signatures.ref = signatures.nature2013,
sample.id = 'TCGA.STAD.muse',
contexts.needed = TRUE,
tri.counts.method = 'exome')
# Plot example
plot_example <- whichSignatures(tumor.ref = sigs.input ,
signatures.ref = signatures.nature2013,
sample.id = 'TCGA.STAD.muse' )
# Plot output
plotSignatures(plot_example, sub = 'example')
这个里面有一个问题,就是deconstructSigs R package似乎只支持hg19版本的基因组,而我下载的TCGA的MAF是hg38版本的,所以代码虽然是对的,但实际上做出的结果是不对的,需要把下载的TCGA的maf文件进行坐标转换。
而所谓批量,无非就是在上面的R脚本里面加入一个循环咯。
(点击阅读原文有这个包的详细说明书哈!)
注意事项,下载的MAF文件可能有两种格式 ,可能是47列,或者120列,第一行一般都是 头文件,注释着每一列的信息,的确,信息量有点略大。如下:
1 Hugo_Symbol
2 Entrez_Gene_Id
3 Center
4 NCBI_Build
5 Chromosome
6 Start_Position
7 End_Position
8 Strand
9 Consequence
10 Variant_Classification
11 Variant_Type
12 Reference_Allele
13 Tumor_Seq_Allele1
14 Tumor_Seq_Allele2
15 dbSNP_RS
16 dbSNP_Val_Status
17 Tumor_Sample_Barcode
18 Matched_Norm_Sample_Barcode
19 Match_Norm_Seq_Allele1
20 Match_Norm_Seq_Allele2
21 Tumor_Validation_Allele1
22 Tumor_Validation_Allele2
23 Match_Norm_Validation_Allele1
24 Match_Norm_Validation_Allele2
25 Verification_Status
26 Validation_Status
27 Mutation_Status
28 Sequencing_Phase
29 Sequence_Source
30 Validation_Method
31 Score
32 BAM_File
33 Sequencer
34 t_ref_count
35 t_alt_count
36 n_ref_count
37 n_alt_count
38 HGVSc
39 HGVSp
40 HGVSp_Short
41 Transcript_ID
42 RefSeq
43 Protein_position
44 Codons
45 Hotspot
46 cDNA_change
47 Amino_Acid_Change
1 Hugo_Symbol
2 Entrez_Gene_Id
3 Center
4 NCBI_Build
5 Chromosome
6 Start_Position
7 End_Position
8 Strand
9 Variant_Classification
10 Variant_Type
11 Reference_Allele
12 Tumor_Seq_Allele1
13 Tumor_Seq_Allele2
14 dbSNP_RS
15 dbSNP_Val_Status
16 Tumor_Sample_Barcode
17 Matched_Norm_Sample_Barcode
18 Match_Norm_Seq_Allele1
19 Match_Norm_Seq_Allele2
20 Tumor_Validation_Allele1
21 Tumor_Validation_Allele2
22 Match_Norm_Validation_Allele1
23 Match_Norm_Validation_Allele2
24 Verification_Status
25 Validation_Status
26 Mutation_Status
27 Sequencing_Phase
28 Sequence_Source
29 Validation_Method
30 Score
31 BAM_File
32 Sequencer
33 Tumor_Sample_UUID
34 Matched_Norm_Sample_UUID
35 HGVSc
36 HGVSp
37 HGVSp_Short
38 Transcript_ID
39 Exon_Number
40 t_depth
41 t_ref_count
42 t_alt_count
43 n_depth
44 n_ref_count
45 n_alt_count
46 all_effects
47 Allele
48 Gene
49 Feature
50 Feature_type
51 One_Consequence
52 Consequence
53 cDNA_position
54 CDS_position
55 Protein_position
56 Amino_acids
57 Codons
58 Existing_variation
59 ALLELE_NUM
60 DISTANCE
61 TRANSCRIPT_STRAND
62 SYMBOL
63 SYMBOL_SOURCE
64 HGNC_ID
65 BIOTYPE
66 CANONICAL
67 CCDS
68 ENSP
69 SWISSPROT
70 TREMBL
71 UNIPARC
72 RefSeq
73 SIFT
74 PolyPhen
75 EXON
76 INTRON
77 DOMAINS
78 GMAF
79 AFR_MAF
80 AMR_MAF
81 ASN_MAF
82 EAS_MAF
83 EUR_MAF
84 SAS_MAF
85 AA_MAF
86 EA_MAF
87 CLIN_SIG
88 SOMATIC
89 PUBMED
90 MOTIF_NAME
91 MOTIF_POS
92 HIGH_INF_POS
93 MOTIF_SCORE_CHANGE
94 IMPACT
95 PICK
96 VARIANT_CLASS
97 TSL
98 HGVS_OFFSET
99 PHENO
100 MINIMISED
101 ExAC_AF
102 ExAC_AF_Adj
103 ExAC_AF_AFR
104 ExAC_AF_AMR
105 ExAC_AF_EAS
106 ExAC_AF_FIN
107 ExAC_AF_NFE
108 ExAC_AF_OTH
109 ExAC_AF_SAS
110 GENE_PHENO
111 FILTER
112 CONTEXT
113 src_vcf_id
114 tumor_bam_uuid
115 normal_bam_uuid
116 case_id
117 GDC_FILTER
118 COSMIC
119 MC3_Overlap
120 GDC_Validation_Status