1. 参考基因组数据

人类基因组主要存放在3个数据库：NCBI、UCSC和ENSEMBL。
NCBI：ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/all_assembly_versions/
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt
UCSC：http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/【UCSC的hg系列，是目前使用频率最高的基因组，最新版本是hg38】
Ensembl：ftp://ftp.ensembl.org/pub/，然后找release版本。
各数据库版本对应情况大致如下：

NCBI数据库	UCSC	ENSEMBL数据库
GRCh36	hg18	ENSEMBL release_52
GRCh37	hg19	ENSEMBL release_59/61/64/68/69/75
GRCh38	hg38	ENSEMBL release_76/77/78/80/81/82/…/104

# (1) reference # 1) hg19 cd ~/reference/hg19 nohup wget -c http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz &

# 2) hg38 cd ~/reference/hg38 nohup wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz &

2. 已知的SNP和Indel变异数据

在Indel区域重比对和碱基质量重校正（BQSR）步骤，需要考虑人群中已知的变异。局部重比对的目的是将BWA比对过程中所发现有潜在序列插入或者序列删除（insertion和deletion，简称Indel）的区域进行重新校正，这个过程往往还会把一些已知的Indel区域一并作为重比对的区域。其根本原因来自于参考基因组的序列特点和BWA这类比对算法本身，这类在全局搜索最优匹配的算法在存在Indel的区域及其附近的比对情况往往不是很准确，特别是当一些存在长Indel、重复性序列的区域或者存在长串单一碱基（比如，一长串的TTTT或者AAAAA等）的区域中更是如此。

# (2) variants # 1) hg19 # 1> SNP mkdir -p hg19/DBSNP && cd hg19/DBSNP ## https://www.ncbi.nlm.nih.gov/projects/SNP/ ## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/ ## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/ nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz & nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz.md5 & nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/All_20180423.vcf.gz.tbi &


wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/dbsnp_138.b37.vcf.gz &

wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/hapmap_3.3.b37.vcf.gz &
# 2> INDEL

mkdir -p hg19/INDEL/ && cd hg19/INDEL/

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.gz.md5

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg19/1000G_phase1.snps.high_confidence.hg19.sites.vcf.idx.gz.md5
wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.gz

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.idx.gz

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz

wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
gunzip 1000G_phase1.indels.b37.vcf.idx.gz

gunzip 1000G_phase1.indels.b37.vcf.gz

gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.gz

gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz
# 2) hg38

# 1> SNP

# http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

nohup wget -c http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/ALL_20141222.dbSNP142_human_GRCh38.snps.vcf.gz &

nohup wget -c http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/other_mapping_resources/ALL_20141222.dbSNP142_human_GRCh38.snps.vcf.gz.tbi &
# ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/

nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_138.hg38.vcf.gz &

nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_138.hg38.vcf.gz.tbi &
# http://ftp://ftp.ncbi.nih.gov/snp/organisms/

nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz &

nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz.tbi &

nohup wget -c ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/All_20180418.vcf.gz.md5 &
nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz &

nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi &
# https://ftp.ncbi.nih.gov/snp/archive/b155/VCF/

cd hucy/hg38/dbsnp155

nohup wget -c https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz &

nohup wget -c https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.md5 &

nohup wget -c https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.tbi &

nohup wget -c https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.39.gz.tbi.md5 &
md5sum -c GCF_000001405.39.gz.tbi.md5
# 2> Indel

# ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/

nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz &

nohup wget -c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi &

3. annovar注释所需数据库的下载

【软件介绍】ANNOVAR注释软件用法

4. 其他数据库

下载gencode数据库中的注释文件。

mkdir -p ~/reference/gtf/gencode && cd ~/reference/gtf/gencode ## GRCh38 https://www.gencodegenes.org/releases/current.html mkdir GRCh38_hg38 && cd GRCh38_hg38


nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gtf.gz &

nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.2wayconspseudos.gtf.gz &

nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.long_noncoding_RNAs.gtf.gz &

nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.polyAs.gtf.gz &
nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gff3.gz &

nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.metadata.HGNC.gz &

nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.metadata.EntrezGene.gz &

nohup wget -c ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.metadata.RefSeq.gz &

## GRCh37 mkdir ~/reference/gtf/gencode/GRCh37_hg19 && cd ~/reference/gtf/gencode/GRCh37_hg19 wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.annotation.gtf.gz wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.metadata.HGNC.gz wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.metadata.EntrezGene.gz wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/GRCh37_mapping/gencode.v28lift37.metadata.RefSeq.gz

参考阅读：
hg19、GRCH37、b37、hs37d5介绍和区别
从零开始完整学习全基因组测序数据分析：第4节构建WGS主流程
基因组的那些事儿(二)
基因组的那些事儿(三)-准备工作

​NGS数据分析实践：02. 参考基因组及注释库的下载

NGS数据分析实践：02. 参考基因组及注释库的下载 - 目录

1. 参考基因组数据

2. 已知的SNP和Indel变异数据

3. annovar注释所需数据库的下载

4. 其他数据库

相关推荐

NGS数据分析实践：02. 参考基因组及注释库的下载