使用wget批量下载geo数据集的全部文件
单细胞转录组教程我们写的差不多了,是时候进军单细胞ATAC和空间单细胞了,找到了这个经典的 《单细胞ATAC》数据集:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE129785 ,对应的文章是:Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat Biotechnol 2019 Aug;37(8):925-936. PMID: 31375813
如下所示的样品情况 :
发现其文件有点多,而且一个文件 就67G了,在自己的电脑里面操作的可能性很小,所以就直接去服务器处理吧。这里分享一下使用wget批量下载geo数据集的全部文件的经验:
首先查看文件列表(注意看网页的网址哦):
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE129nnn/GSE129785/suppl/
网页的网址是有规律的, 同理,文件名也是有规律的:
文件名如下所示:
GSE129785_RAW.tar 2019-04-18 13:03 67G
GSE129785_scATAC-Hematopoiesis-All.cell_barcodes.txt.gz 2019-04-14 19:30 2.9M
GSE129785_scATAC-Hematopoiesis-All.mtx.gz 2019-04-14 19:20 6.0G
GSE129785_scATAC-Hematopoiesis-All.peaks.txt.gz 2019-04-14 19:30 4.3M
GSE129785_scATAC-Hematopoiesis-CD34.cell_barcodes.txt.gz 2019-04-14 19:30 849K
GSE129785_scATAC-Hematopoiesis-CD34.mtx.gz 2019-04-14 19:21 2.1G
GSE129785_scATAC-Hematopoiesis-CD34.peaks.txt.gz 2019-04-14 19:30 4.3M
GSE129785_scATAC-PBMCs-Fresh.cell_barcodes.txt.gz 2019-04-14 19:30 192K
GSE129785_scATAC-PBMCs-Fresh.mtx.gz 2019-04-14 19:21 415M
GSE129785_scATAC-PBMCs-Fresh.peaks.txt.gz 2019-04-14 19:30 1.8M
GSE129785_scATAC-PBMCs-Frozen.cell_barcodes.txt.gz 2019-04-14 19:30 228K
GSE129785_scATAC-PBMCs-Frozen.mtx.gz 2019-04-14 19:21 209M
GSE129785_scATAC-PBMCs-Frozen.peaks.txt.gz 2019-04-14 19:30 1.2M
GSE129785_scATAC-PBMCs-FrozenSort.cell_barcodes.txt.gz 2019-04-14 19:30 207K
GSE129785_scATAC-PBMCs-FrozenSort.mtx.gz 2019-04-14 19:21 319M
GSE129785_scATAC-PBMCs-FrozenSort.peaks.txt.gz 2019-04-14 19:30 1.3M
GSE129785_scATAC-TME-All.cell_barcodes.txt.gz 2019-04-14 19:30 1.7M
GSE129785_scATAC-TME-All.mtx.gz 2019-04-14 19:21 4.5G
GSE129785_scATAC-TME-All.peaks.txt.gz 2019-04-14 19:30 4.4M
GSE129785_scATAC-TME-TCells.cell_barcodes.txt.gz 2019-04-14 19:30 1.3M
GSE129785_scATAC-TME-TCells.mtx.gz 2019-04-14 19:21 2.8G
GSE129785_scATAC-TME-TCells.peaks.txt.gz 2019-04-14 19:30 4.4M
简单的把上面的文件名存放到一个文本文件 list.txt ,就可以使用下面的命令批量下载啦:
mkdir ~/scRNA/atac
cd ~/scRNA/atac
awk '{print $1}' list.txt |while read id;do (nohup wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE129nnn/GSE129785/suppl/$id & );done
一个很简单的shell脚本,就可以得到全部文件如下所示;
67G Apr 19 2019 GSE129785_RAW.tar
2.9M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.cell_barcodes.txt.gz
6.0G Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.mtx.gz
4.3M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-All.peaks.txt.gz
849K Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.cell_barcodes.txt.gz
2.2G Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.mtx.gz
4.3M Apr 15 2019 GSE129785_scATAC-Hematopoiesis-CD34.peaks.txt.gz
193K Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.cell_barcodes.txt.gz
415M Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.mtx.gz
1.9M Apr 15 2019 GSE129785_scATAC-PBMCs-Fresh.peaks.txt.gz
229K Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.cell_barcodes.txt.gz
210M Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.mtx.gz
1.2M Apr 15 2019 GSE129785_scATAC-PBMCs-Frozen.peaks.txt.gz
208K Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.cell_barcodes.txt.gz
320M Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.mtx.gz
1.3M Apr 15 2019 GSE129785_scATAC-PBMCs-FrozenSort.peaks.txt.gz
1.8M Apr 15 2019 GSE129785_scATAC-TME-All.cell_barcodes.txt.gz
4.6G Apr 15 2019 GSE129785_scATAC-TME-All.mtx.gz
4.4M Apr 15 2019 GSE129785_scATAC-TME-All.peaks.txt.gz
1.3M Apr 15 2019 GSE129785_scATAC-TME-TCells.cell_barcodes.txt.gz
2.9G Apr 15 2019 GSE129785_scATAC-TME-TCells.mtx.gz
4.4M Apr 15 2019 GSE129785_scATAC-TME-TCells.peaks.txt.gz
是不是超级简单啊!
当然了,后续的分析才是苦难的开始,虽然咱有普通atac-seq技术打底,学一个新技术会很快,但该有的挫折感并不会少!
另外一个超级经典的《单细胞ATAC》数据集是 :
GSE96772. Integrated Single-Cell Analysis Maps the Continuous Regulatory Landscape of Human Hematopoietic Differentiation. Cell 2018 May 31;173(6):1535-1548.e16. PMID: 29706549
感兴趣的小伙伴也可以跟我一样,批量下载它们的结果,然后开启下游分析哦!