超2万样本的RNA-seq数据重新统一处理(TCGA+GTEx+ TARGET)
各种大型计划产出的RNA-seq数据资源已经非常丰富了,但是大家都想把多个数据库联合起来分析,就不得不面对批次效应这个问题,所以UCSC团队就使用统一的流程把这些数据重新处理了,在亚马逊云上,一个样本花费1.3美元。
发表在:Nature Biotechnology publication: https://doi.org/10.1038/nbt.3772
3大数据库是:
The Cancer Genome Atlas (TCGA)
Genotype-Tissue Expression (GTEx)
Therapeutically Applicable Research To Generate Effective Treatments (TARGET)
而且还提供网页工具供查询使用:
Differential gene and isoform expression of FOXM1 transcription factor in TCGA vs. GTEx
使用的数据处理流程
如下图: CutAdapt was used for adapter trimming, STAR was used for alignment, and RSEM and Kallisto were used as quantifiers.
流程介绍
如果你对RNA-seq数据处理流程有意外,直接去看我长达74个小时全套生物信息学入门视频:生信技能树视频课程学习路径,这么好的视频还免费!
参考基因组选择
STAR, RSEM, and Kallisto indexes were all built with the same reference genome. HG38 (no alt analysis) with overlapping genes from the PAR locus removed (chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415).
ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines
注释文件的选择
RSEM: Gencode V23 comprehensive annotation (CHR)
http://www.gencodegenes.org/releases/23.html first row
Kallisto: Gencode V23 comprehensive annotation (ALL)
http://www.gencodegenes.org/releases/23.html second row
软件参数的选择
STAR
sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/star --runThreadN 32 --runMode genomeGenerate --genomeDir /data/genomeDir --genomeFastaFiles hg38.fa --sjdbGTFfile gencode.v23.annotation.gtf
Kallisto
sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/kallisto index -i hg38.gencodeV23.transcripts.idx transcriptome_hg38_gencodev23.fasta
Kallisto index that was used during the recompute is available here.
RSEM
sudo docker run -v $(pwd):/data --entrypoint=rsem-prepare-reference jvivian/rsem -p 4 --gtf gencode.v23.annotation.gtf hg38.fa hg38
可以看到,上面的3大要素, 就是我五年前在 生信菜鸟团博客写教程的基本规律。
Raw data
Nature Publication Supplementary Note 7 – Data Availability
Submitter sample ID to Xena sample ID mapping
TCGA mapping
GTEx mapping
TARGET mapping
最后公布的可供下载的数据集
GTEX (11 datasets)
TARGET Pan-Cancer (PANCAN) (12 datasets)
TCGA and TARGET Pan-Cancer (PANCAN) (4 datasets)
TCGA Pan-Cancer (PANCAN) (10 datasets)
TCGA TARGET GTEx (13 datasets)
其中TCGA TARGET GTEx 3大数据库) (共有 13 datasets)
cohort: TCGA TARGET GTEx
表达矩阵样本量很可观
RSEM expected_count
(n=19,109)
UCSC Toil RNAseq Recompute
RSEM expected_count (DESeq2 standardized)
(n=19,039)
UCSC Toil RNAseq Recompute
RSEM expected_count output normalized using DESeq2
RSEM fpkm
(n=19,131)
UCSC Toil RNAseq Recompute
RSEM norm_count
(n=19,120)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx gene expression by UCSC TOIL RNA-seq recompute
RSEM tpm
(n=19,131)
UCSC Toil RNAseq Recompute
phenotype
TCGA GTEX main categories
(n=17,221)
UCSC Toil RNAseq Recompute
TCGA survival data
(n=10,496)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEX selected phenotypes
(n=19,131)
UCSC Toil RNAseq Recompute
somatic mutation (SNP and INDEL)
TCGA somatic mutations (Pan-cancer Atlas MC3 public version)
(n=8,463)
UCSC Toil RNAseq Recompute
transcript expression RNAseq
RSEM expected_count
(n=19,109)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
RSEM fpkm
(n=19,129)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
RSEM isoform percentage
(n=19,131)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute
RSEM tpm
(n=19,131)
UCSC Toil RNAseq Recompute
TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute