NGS数据分析实践:02. 参考基因组及注释库的下载
NGS数据分析实践:02. 参考基因组及注释库的下载 - 目录
1. 参考基因组数据
2. 已知的SNP和Indel变异数据
3. annovar注释所需数据库的下载
4. 其他数据库
系列文章:
二代测序方法:DNA测序之靶向重测序
NGS数据分析实践:00. 变异识别的基本流程
NGS数据分析实践:01. Conda环境配置及软件安装
1. 参考基因组数据
人类基因组主要存放在3个数据库:NCBI、UCSC和ENSEMBL。
NCBI:ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt
UCSC:http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/【UCSC的hg系列,是目前使用频率最高的基因组,最新版本是hg38】
Ensembl:ftp://ftp.ensembl.org/pub/,然后找release版本。
各数据库版本对应情况大致如下:
NCBI数据库 | UCSC | ENSEMBL数据库 |
---|---|---|
GRCh36 | hg18 | ENSEMBL release_52 |
GRCh37 | hg19 | ENSEMBL release_59/61/64/68/69/75 |
GRCh38 | hg38 | ENSEMBL release_76/77/78/80/81/82/…/104 |
# (1) reference
# 1) hg19
cd ~/reference/hg19
nohup wget -c http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz &
# 2) hg38
cd ~/reference/hg38
nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz &
2. 已知的SNP和Indel变异数据
在Indel区域重比对和碱基质量重校正(BQSR)步骤,需要考虑人群中已知的变异。局部重比对的目的是将BWA比对过程中所发现有潜在序列插入或者序列删除(insertion和deletion,简称Indel)的区域进行重新校正,这个过程往往还会把一些已知的Indel区域一并作为重比对的区域。其根本原因来自于参考基因组的序列特点和BWA这类比对算法本身,这类在全局搜索最优匹配的算法在存在Indel的区域及其附近的比对情况往往不是很准确,特别是当一些存在长Indel、重复性序列的区域或者存在长串单一碱基(比如,一长串的TTTT或者AAAAA等)的区域中更是如此。
# (2) variants
# 1) hg19
# 1> SNP
mkdir -p hg19/DBSNP && cd hg19/DBSNP
## https://www.ncbi.nlm.nih.gov/projects/SNP/
## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/
## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/
nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz &
nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz.md5 &
nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz.tbi &
wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/dbsnp_138.b37.vcf.gz &
wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/hapmap_3.3.b37.vcf.gz &
# 2> INDEL
mkdir -p hg19/INDEL/ && cd hg19/INDEL/
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.md5
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz.md5
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.gz
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.idx.gz
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
gunzip 1000G_phase1.indels.b37.vcf.idx.gz
gunzip 1000G_phase1.indels.b37.vcf.gz
gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.gz
gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
# 2) hg38
# 1> SNP
# http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/
nohup wget -c http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/ALL_20141222.dbSNP142_human_GRCh38.snps.vcf.gz &
nohup wget -c http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/ALL_20141222.dbSNP142_human_GRCh38.snps.vcf.gz.tbi &
# ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_138.hg38.vcf.gz &
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_138.hg38.vcf.gz.tbi &
# http://ftp://ftp.ncbi.nih.gov/snp/organisms/
nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz &
nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz.tbi &
nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz.md5 &
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz &
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi &
# https://ftp.ncbi.nih.gov/snp/archive/b155/VCF/
cd hucy/hg38/dbsnp155
nohup wget -c https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz &
nohup wget -c https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.md5 &
nohup wget -c https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.tbi &
nohup wget -c https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.tbi.md5 &
md5sum -c GCF_000001405.39.gz.tbi.md5
# 2> Indel
# ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz &
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi &
3. annovar注释所需数据库的下载
4. 其他数据库
下载gencode数据库中的注释文件。
mkdir -p ~/reference/gtf/gencode && cd ~/reference/gtf/gencode
## GRCh38 https://www.gencodegenes.org/releases/current.html
mkdir GRCh38_hg38 && cd GRCh38_hg38
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gtf.gz &
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.2wayconspseudos.gtf.gz &
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.long_noncoding_RNAs.gtf.gz &
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.polyAs.gtf.gz &
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz &
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.metadata.HGNC.gz &
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.metadata.EntrezGene.gz &
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.metadata.RefSeq.gz &
## GRCh37
mkdir ~/reference/gtf/gencode/GRCh37_hg19 && cd ~/reference/gtf/gencode/GRCh37_hg19
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.annotation.gtf.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.metadata.HGNC.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.metadata.EntrezGene.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.metadata.RefSeq.gz
参考阅读:
hg19、GRCH37、b37、hs37d5介绍和区别
从零开始完整学习全基因组测序数据分析:第4节 构建WGS主流程
基因组的那些事儿(二)
基因组的那些事儿(三)-准备工作