3500个TNBC单细胞转录组数据重处理
文章:A Targetable EGFR-Dependent Tumor-Initiating Program in Breast Cancer , 因为bulk测序无法解决问题,所以作者选择了单细胞转录组测序策略:
To understand functional properties associated with heterogeneous EGFR expression in an unbiased manner, single cell RNA-seq was performed on freshly dissociated cells from the PDX (3,483 cells, with an average of 40,564 unique molecular identifiers (UMIs) and 5,146 genes detected per cell)
数据都在SRA数据库里面,如下:https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP110989
但是作者并没有给表达矩阵,所以只能自行下载原始数据进行单细胞转录组全流程处理。
mkdir -p ~/data/public/TNBC/
cd ~/data/public/TNBC/
nohup wget -c ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR579/SRR5799774/SRR5799774.sra &
nohup wget -c ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR579/SRR5799776/SRR5799776.sra &
nohup wget -c ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR579/SRR5799775/SRR5799775.sra &nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 SRR5799774.sra &
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 SRR5799775.sra &
nohup ~/biosoft/sratoolkit/sratoolkit.2.6.3-centos_linux64/bin/fastq-dump --gzip --split-3 SRR5799776.sra &
下载并且解压后是:
1.7G Jan 22 23:34 SRR5799774_1.fastq.gz
13G Jan 22 23:34 SRR5799774_2.fastq.gz
9.4G Jan 22 17:40 SRR5799774.sra
1.7G Jan 22 23:33 SRR5799775_1.fastq.gz
13G Jan 22 23:33 SRR5799775_2.fastq.gz
9.4G Jan 22 17:31 SRR5799775.sra
2.9G Jan 23 00:55 SRR5799776_1.fastq.gz
24G Jan 23 00:55 SRR5799776_2.fastq.gz
18G Jan 22 18:25 SRR5799776.sra
可以看到左右端数据文件大小差别很大,因为这个不是普通的双端测序。
需要在作者的文章里面找到测序的描述,这篇文章的补充材料有介绍:
26 bp Read1, 8 bp I7 Index, 0 bp I5 Index and 98 bp Read2.
测序数据量是:a total of 717,982,475
reads, and 179,137
reads per single-cell
因为是 10x Genomics方法做的单细胞转录组数据,所以需要使用他们发表的工具来处理:Cell Ranger ,需要简单注册才能下载安装,我下载了一个测试数据,发现:
├── [237M] neurons_900_S1_L001_I1_001.fastq.gz
├── [642M] neurons_900_S1_L001_R1_001.fastq.gz
├── [1.8G] neurons_900_S1_L001_R2_001.fastq.gz
├── [238M] neurons_900_S1_L002_I1_001.fastq.gz
├── [646M] neurons_900_S1_L002_R1_001.fastq.gz
└── [1.8G] neurons_900_S1_L002_R2_001.fastq.gz
可以看到左右端测序数据大小不一致,而且每次测序是有3个数据,因为26bp read1 (16bp Chromium barcodeand 10bp UMI), 98bp read2 (transcript), and 8bp I7 sample barcode ,只有reads2的fastq里面是真正的转录本序列,另外的两个文件都是barcode!可以直接用 Cell Ranger 来做分析,代码如下:
/home/jianmingzeng/biosoft/10xgenomic/cellranger-2.1.0/cellranger count --id=neurons \
--localcores 5 \
--transcriptome=/home/jianmingzeng/biosoft/10xgenomic/db/refdata-cellranger-mm10-1.2.0 \
--fastqs=/home/jianmingzeng/data/public/10x/neurons_900_fastqs \
--sample=neurons \
--expect-cells=900
但是作者上传的数据缺失了关键信息,我写信给10x genomics公司的人咨询了这件事
I just read a paper: A Targetable EGFR-Dependent Tumor-Initiating Program in Breast Cancer
and they choose 10x genomics for scRNA-seq, and upload the raw data into SRA database.While I've download them, there should be 26 bp Read1, 8 bp I7 Index, 0 bp I5 Index and 98 bp Read2.
But I just found the 8 bp in fq1, and 98bp in fq2, the key information just lost , which means I can't use the Cell Ranger to process them.
Any help ?
公司回复我说,如果缺失barcode信息,这个数据是没办法处理的。
Michael Campbell (10x Genomics)Jan 26, 07:03 PST Hi Jianming,
That's right if you don't have the 26bp read with the 10x barcode and UMI in it you can't use Cell Ranger, or any other tool for that matter because there is no way to related the second read to the cell it came from. I would contact the corresponding author to see what happened to the R1 read. If you want, you can send me the SRR number and I can have a look to see if the R1 read is buried somewhere.
Best,
Mike
然后我给出了文章以及SRA号,公司的任又检查了一遍,的确是作者的失误。
Hi Jainming,
It looks like they uploaded the index read as read 1 instead of the read with the barcode. It's not analyzable in this format.
Best,
Mike
如果这个数据集是好的,那么可以按照我们前面大家教程来处理: