【软件介绍】ANNOVAR注释软件用法
变异检测得到的结果是检测样本的基因组序列与参考基因组序列之间的差异。本质上是一个将真实的变异从文库准备、样本富集、检测/测序和映射/比对产生的产物中分离出来的过程。想要进一步研究每一个变异的实际意义,需要将变异检测的结果和各种数据库进行关联,得到变异对应的基因、变异导致的氨基酸变化和变异的临床信息等生物学功能信息,这个过程叫做变异注释。本文详细介绍annovar注释工具的使用。
ANNOVAR是一个高效的开源注释工具,由perl编写,能够利用最新的数据来分析各种基因组中的SNP和Indel遗传变异,支持包括VCF在内的多种输入和输出文件格式。它能够从不同的基因组的变异位点对其进行功能注释(包括人类基因组hg18、hg19、hg38、以及鼠、蠕虫、酵母等),不过部分物种需要自己构建数据库。
主要包含三种不同的注释方法:gene-based, region-based 和filter-based。基于基因的注释(Gene-based Annotation)揭示variant与已知基因直接的关系以及对其产生的功能性影响,基于区域的注释(Region-based Annotation)揭示variant 与不同基因组特定区域的关系,例如:它是否落在转录因子结合区域等,基于筛选的注释(Filter-based Annotation)则分析变异位点是否位于指定的数据库中,比如dbSNP, 1000G,ESP 6500等数据库。
1. ANNOVAR的注册安装
注册(需使用机构邮箱) → 网站发送邮件 → 邮件返回下载地址 →下载/上传到服务器 → tar 解压安装
官网:https://annovar.openbioinformatics.org/en/latest/user-guide/download/
注册之后才能download,使用教育机构后缀的邮箱即可注册。如,.edu/.ac/.gov
ANNOVAR注册网址:https://www.openbioinformatics.org/annovar/annovar_download_form.php
该软件程序是用perl语言写的,所以可以作为独立程序运行于各个已经安装Perl的系统。解压直接用即可。
以下示例皆在linux系统中完成,且保证服务器有足够的存储空间。
cd ~/biosoft
mkdir annovar && cd annovar
wget http://www.openbioinformatics.org/annovar/download/***/annovar.latest.tar.gz #自行注册,获得下载地址;或迅雷等下载 + WinCP等上传
tar xvf annovar.latest.tar.gz
echo 'export PATH=~/biosoft/annovar:$PATH' >> ~/.bashrc
source ~/.bashrc
2. ANNOVAR的程序模块
ANNOVAR程序有以下几个模块:
# hucy 23:18:18 ~/biosoft/annovar
# ANNOVAR程序结构
tree -L 1
#.
#├── annotate_variation.pl #主程序,功能包括下载数据库,三种不同的注释
#├── coding_change.pl #可用来推断蛋白质序列
#├── convert2annovar.pl #将多种格式转为.avinput的程序
#├── example #存放示例文件
#├── humandb #人类注释数据库
#├── retrieve_seq_from_fasta.pl #用于自行建立其他物种的转录本
#├── table_annovar.pl #注释程序,可一次性完成三种类型的注释
#└── variants_reduction.pl #可用来更灵活地定制过滤注释流程
#2 directories, 6 files
# annovar 别名设置,简化命令行
vim ~/.bashrc
alias convert2annovar.pl="/home/hucy/biosoft/annovar/convert2annovar.pl"
alias annotate_variation.pl="/home/hucy/biosoft/annovar/annotate_variation.pl"
alias table_annovar.pl="/home/hucy/biosoft/annovar/table_annovar.pl"
alias coding_change.pl="/home/hucy/biosoft/annovar/coding_change.pl"
alias retrieve_seq_from_fasta.pl="/home/hucy/biosoft/annovar/retrieve_seq_from_fasta.pl"
alias variants_reduction.pl="/home/hucy/biosoft/annovar/variants_reduction.pl"
source ~/.bashrc
3. 下载人类变异注释数据库
ANNOVAR的安装包里自带了一些常用的数据库,在humandb/目录下。
在ANNOVAR的主页面有用于注释的各种数据库的描述,如,dbSNP、ExAC、ESP6500、cosmic、gnomad、1000genomes、clinvar、gwas,且提供多个不同参考基因组版本的下载,可根据需求自行下载(注意数据库可能时有更新)。我们默认他们做的工作都是准确无误的,毕竟自己去一个个下载数据库一个个格式化成自己需要的格式,也是不小的工作量。
# 常用数据库说明:
# refGene: RefSeq中所有带注释的转录本的FASTA序列
# KnownGene: UCSC已知基因中所有带注释的转录本的FASTA序列
# cytoBand:每个细胞间 band(cytogenetic band)的染色体坐标信息
# snp138:dbSNP with ANNOVAR index files
# avsnp138-avsnp150:重新格式化的dbSNP数据集
# cosmic: cosmic数据库中的癌症突变信息
# esp6500siv2:ESP是由NHLBI资助的外显子测序项目,是从6000多个个体中识别外显子区域的遗传变异
# 1000g2015aug:alternative allele frequency data in 1000 Genomes Project(version August 2015)
# exac03:ExAC 65000 exome allele frequency data for ALL, AFR (African), AMR (Admixed American), EAS (East Asian), FIN (Finnish), NFE (Non-finnish European), OTH (other), SAS (South Asian)). version 0.3. Left normalization done
# ljb26_all:whole-exome SIFT, PolyPhen2 HDIV, PolyPhen2 HVAR, LRT, MutationTaster, MutationAssessor, FATHMM, MetaSVM, MetaLR, VEST, CADD, GERP++, PhyloP and SiPhy scores from dbNSFP version 2.6
# clinvar_20200316:CLINVAR database with Variant Clinical Significance (unknown, untested, non-pathogenic, probable-non-pathogenic, probable-pathogenic, pathogenic, drug-response, histocompatibility, other) and Variant disease name
官网下载中心:https://annovar.openbioinformatics.org/en/latest/user-guide/download/
# help
annotate_variation.pl
#Example: #download annotation databases from ANNOVAR or UCSC and save to humandb/ directory
# annotate_variation.pl -downdb -webfrom annovar refGene humandb/
# annotate_variation.pl -downdb -buildver mm9 refGene mousedb/
# annotate_variation.pl -downdb -buildver hg19 -webfrom annovar esp6500siv2_all humandb/
# -downdb表明该命令的用途是下载数据库
# -webform annovar 从annovar提供的镜像下载(不加此参数将默认从ucsc下载)
# -buildver指定基因组版本(默认为hg18)
# refGene代表的是下载的数据库的名字
# humandb/表示数据库存储的路径
# database
# 数据库时有更新,注意官网最新公示。数据库名字,在 https://annovar.openbioinformatics.org/en/latest/user-guide/download/ 全部列出。
# 下载方式:①annotate_variation.pl -downdb命令下载;②根据命令提示的链接,利用wget下载;③迅雷等下载,WinCP上传。
# ①-downdb下载
annovar_dir=/home/hucy/biosoft/annovar
annotate_variation.pl -downdb -webfrom annovar --buildver hg38 refGene $annovar_dir/humandb/
annotate_variation.pl -downdb -webfrom annovar --buildver hg38 avsnp150 $annovar_dir/humandb/
annotate_variation.pl -downdb -webfrom annovar --buildver hg38 exac03 $annovar_dir/humandb/
annotate_variation.pl -downdb -webfrom annovar --buildver hg38 esp6500siv2_all $annovar_dir/humandb/
annotate_variation.pl -downdb -webfrom annovar --buildver hg38 clinvar_20160302 $annovar_dir/humandb/
annotate_variation.pl -downdb -webfrom annovar --buildver hg38 1000g2014oct $annovar_dir/humandb/
annotate_variation.pl -downdb -webfrom annovar --buildver hg38 1000g2015aug $annovar_dir/humandb/
# alternative allele frequency data in 1000 Genomes Project for autosomes (ALL, AFR (African), AMR (Admixed American), EAS (East Asian), EUR (European), SAS (South Asian)). Based on 201409 collection v5 (based on 201305 alignment) but including chrX and chrY data finally!
# 下载失败
# $annotate_variation.pl -downdb -webfrom annovar --buildver hg38 avsnp150 $annovar_dir/humandb/
# WARNING: cannot retrieve remote files automatically (by 'wget' command or by standard Net::FTP/LWP::UserAgent Perl module).
# Please manually download the following file, uncompress the files to /home/hucy/biosoft/annovar//humandb directory, then add a hg38_ prefix to the file names.
# http://www.openbioinformatics.org/annovar/download/hg38_avsnp150.txt.gz
# http://www.openbioinformatics.org/annovar/download/hg38_avsnp150.txt.idx.gz
# ②wget下载
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_refGene.txt.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_refGeneMrna.fa.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_refGeneVersion.txt.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_avsnp150.txt.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_avsnp150.txt.idx.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_exac03.txt.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_exac03.txt.idx.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_esp6500siv2_all.txt.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_esp6500siv2_all.txt.idx.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_clinvar_20160302.txt.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_clinvar_20160302.txt.idx.gz &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_1000g2014oct.zip &
nohup wget -c http://www.openbioinformatics.org/annovar/download/hg38_1000g2015aug.zip &
# wget -t 1 -T 30 -q -O hg19_gnomad_genome.txt.gz http://www.openbioinformatics.org/annovar/download/hg19_gnomad_genome.txt.gz
# -t 设置重试次数,-T 设置超时为SECONDS秒,-q 静默下载(无信息输出),-O 将文档写入FILE
# 均下载失败
# ③迅雷等下载,WinCP上传
cd ~/biosoft/annovar/humandb
gunzip -k hg38*
# mkdir humandbgz
# mv *.gz ~/biosoft/annovar/humandb/humandbgz/
展示一下我自己下载的:
再展示一下我去隔壁用户那里顺的(符号/软链接):
4. ANNOVAR输入格式转换
(1) 输入文件格式
ANNOVAR使用.avinput格式,用空格或者制表符分隔,最少需要5列,分别代表①染色体(Chromosome),②起始位置(Start),③终止位置(End),④参考等位基因(Reference Allele),⑤替代等位基因(Alternative Allele),其他的列作为额外补充信息(可选)。插入或者删除以-表示, “0” 代表只指定position,而不指定实际核苷酸。
文件示例:
# 示例1:
cat example/ex1.avinput
# 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
# 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
# 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
# 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
# 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
# 1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
# 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
# 1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
# 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
# 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
# 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
# 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
# 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
# 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
# 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
# 示例2:
cat test.avinput
# chr1 69897 69897 T C hom 38.56 2
# chr1 873251 873251 G A hom 3291.77 92
# chr1 873542 873542 G A hom 2996.77 68
# chr1 873548 873548 C T hom 3030.77 69
# chr1 917495 917495 C T hom 8591.77 245
# chr1 917584 917584 T G hom 9016.77 244
# chr1 930939 930939 G A hom 522.77 15
# chr1 935954 935954 G T hom 1655.77 49
# chr1 941119 941119 A G hom 1080.77 31
# chr1 942335 942335 C G hom 755.77 22
(2) 输入文件格式转换
ANNOVAR主要使用convert2annovar.pl
程序进行转换,转换后文件是精简过的,主要包含前面提到的5列内容,如果要将原格式文件的所有内容都包含在转换后的.avinput
文件中,可以使用-includeinfo
参数;如果需要分开每个sample输出单一的.avinput
文件,可以使用-allsample
参数,等等。对于含有多个样本的vcf文件,格式转换时只会取其第一个样本进行注释,也就是说即使别的样本在这个位点有变异,只要第一个样本在某个位点没有变异转换时就会将这个位点去掉不会出现在注释文件中。如果想要得到所有样本的变异位点的注释的话,可以先将其拆分为几个样本的注释输入文件。
ANNOVAR主要支持以下等格式转换:VCF、SAMtools pileup format、Complete Genomics format、GFF3-SOLiD calling format、SOAPsnp calling format、MAQ calling format、CASAVA calling format……
# help
convert2annovar.pl
# Example: convert2annovar.pl -format pileup -outfile variant.query variant.pileup
# convert2annovar.pl -format cg -outfile variant.query variant.cg
# convert2annovar.pl -format cgmastervar variant.masterVar.txt
# convert2annovar.pl -format gff3-solid -outfile variant.query variant.snp.gff
# convert2annovar.pl -format soap variant.snp > variant.avinput
# convert2annovar.pl -format maq variant.snp > variant.avinput
# convert2annovar.pl -format casava -chr 1 variant.snp > variant.avinput
# convert2annovar.pl -format vcf4 variantfile > variant.avinput
# convert2annovar.pl -format vcf4 -filter pass variantfile -allsample -outfile variant
# convert2annovar.pl -format vcf4old input.vcf > output.avinput
# convert2annovar.pl -format rsid snplist.txt -dbsnpfile snp138.txt > output.avinput
# convert2annovar.pl -format region -seqdir humandb/hg19_seq/ chr1:2000001-2000003 -inssize 1 -delsize 2
# convert2annovar.pl -format transcript NM_022162 -gene humandb/hg19_refGene.txt -seqdir humandb/hg19_seq/
# 示例1:VCF格式
convert2annovar.pl -format vcf4 example/ex2.vcf -out ex2
convert2annovar.pl -format vcf4 -allsample -withfreq final.vcf.gz > SNP.final.avinput #多样本
# convert2annovar.pl -format vcf4 example/ex2.vcf > ex2.avinput
# -format vcf4 指定格式为vcf
# -out ex2 指定输出文件前缀
# -includeinfo 会保留vcf文件中的所有信息(可选)
# -comment 会保留vcf文件的头部注释信息(以#开头的行,可选)
# -allsample 转换格式时vcf中的每一个样本会单独生成一个待注释的vcf文件(可选)
# 查看ex2.vcf文件
cat example/ex2.vcf
# ##fileformat=VCFv4.0
# ##fileDate=20090805
# ##source=myImputationProgramV3.1
# ##reference=1000GenomesPilot-NCBI36
# ##phasing=partial
# ##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
# ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
# ##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
# ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
# ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
# ##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
# ##FILTER=<ID=q10,Description="Quality below 10">
# ##FILTER=<ID=s50,Description="Less than 50% of samples have data">
# ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
# ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
# ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
# ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
# #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
# 16 50745926 rs2066844 C T 80 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
# 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
# 20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
# 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
# 20 1230237 . T G 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
# 20 1230288 . T . 50 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
# 20 1234567 microsat1 GTCT G,GTACT 50 PASS NS=3;DP=9;AA=G GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3
# 查看ex2.avinput文件
cat ex2.avinput
# 20 1110696 1110696 A G het 67 6
# 20 1110696 1110696 A T het 67 6
# 20 1234568 1234570 TCT - het 50 4
# 示例2:选择一些snp位点进行注释,可用-format rsid选项
convert2annovar.pl -format rsid example/snplist.txt -dbsnpfile humandb/hg19_snp138.txt > snplist.avinput
#查看snplist.txt文件
cat example/snplist.txt
### snplist.txt内容如下
# rs74487784
# rs41534544
# rs4308095
# rs12345678
#查看snplist.avinput文件
cat snplist.avinput
### snplist.avinput内容如下
# chr2 186229004 186229004 C T rs4308095
# chr7 6026775 6026775 T C rs41534544
# chr7 6777183 6777183 G A rs41534544
# chr9 3901666 3901666 T C rs12345678
# chr22 24325095 24325095 A G rs74487784
# 若需注释的为plink生成的文件(假设有666个SNP):genotype.bim
cat genotype.bim | head
# 1 rs36096196 0 2252205 T C
# 1 rs17035646 0 10796547 G A
# 1 rs880315 0 10796866 T C
# 1 rs4846049 0 11850365 T G
# 1 rs12027135 0 25775733 T A
# 1 rs10890238 0 38445654 T A
# 1 rs61776719 0 38461319 A C
# 1 rs2296172 0 39835817 G A
# 1 rs11205760 0 51174330 C T
# 1 rs11206510 0 55496039 C T
cat genotype.bim | awk '{print $2}' > snp666.txt
cat snp666.txt | head
# rs36096196
# rs17035646
# rs880315
# rs4846049
# rs12027135
# rs10890238
# rs61776719
# rs2296172
# rs11205760
# rs11206510
convert2annovar.pl -format rsid snp666.txt -dbsnpfile humandb/hg19_snp138.txt > SNP666.avinput
注:
1).vcf文件在格式转换时,若突变位点有两个不同的等位基因则在结果文件中会分两行放。
2).在注释时,遇到格式不符合的行会跳过继续注释而不是终止注释,最后那些格式不符合的行会生成另一个文件(*.invalid_input)。
5. ANNOVAR注释功能
annovar提供了两个脚本以供注释使用:annotate_variation.pl一次注释一个数据库,table_annovar.pl一次注释多个数据库。
(1) table_annovar.pl(可一次完成基于基因、区域和filter三种类型的注释)
使用ANNOVAR最简单的方法就是使用table_annovar.pl进行注释,它的输入文件可以是多种格式包括VCF,输出文件已Tab分隔,每一列代表着一种注释。
注释命令示例:
# hg19
table_annovar.pl hg19.variant.avinput ~/biosoft/annovar/humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2015aug_all,1000g2015aug_eur,exac03,avsnp147,dbnsfp30a -operation g,r,r,f,f,f,f,f,f -nastring . -csvout
# -buildver hg19 表示使用hg19版本
# -out myanno 表示输出文件的前缀为myanno
# -remove 表示删除注释过程中的临时文件
# -protocol 表示注释使用的数据库,用逗号隔开,且要注意顺序
# -operation 表示对应顺序的数据库的类型(g代表gene-based、r代表region-based、f代表filter-based),用逗号隔开,注意顺序
# -nastring . 表示用点号替代缺省的值
# -csvout 表示最后输出.csv文件
# hg38
table_annovar.pl SNP.final.avinput $annovar_dir/humandb/ -buildver hg38 -out SNP.final -remove -protocol refGene,avsnp150,exac03,esp6500siv2_all,EAS.sites.2014_10,EAS.sites.2015_08,ALL.sites.2014_10,ALL.sites.2015_08,clinvar_20160302 -operation g,f,f,f,f,f,f,f,f -nastring . -csvout &
输出的csv文件将包含输入的5列主要信息以及各个数据库里的注释,此外,table_annoval.pl可以直接对vcf文件进行注释(不需要转换格式),注释的内容将会放在vcf文件的“INFO”那一栏。
注释结果示例:
(2) annotate_variation.pl
Annotate_variation.pl的注释方式分为三种:1) Gene-based annotation;2) Region-based annotation;3) Filter-based annotation
基于基因的注释, exonic、splicing、ncRNA、UTR5、UTR3、intronic、upstream、downstream、intergenic,使用 geneanno 子命令。基于基因的注释是想搞清楚自己的vcf文件记录的突变位点,是否在基因组的哪些功能元件(主要是外显子)上面。
基于区域的注释, cytoBand、TFBS、SV、bed、GWAS、ENCODE、enhancers、repressors、promoters ,使用regionanno 子命令。只考虑位点坐标
基于数据库的过滤, dbSNP、ExAC、ESP6500、cosmic、gnomad、1000genomes、clinvar 使用 filter 子命令。考虑位点坐标同时关心突变碱基情况。
annotate_variation.pl -geneanno -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 example/ex1.avinput humandb/
annotate_variation.pl -filter -dbtype 1000g2014oct_all -buildver hg19 example/ex1.avinput humandb/
#三种命令示例,使用package自带数据进行注释,分别对应三种注释方式
# -geneanno 表示使用基于基因的注释
# -region表示基于区域的注释
# -filter 表示使用基于筛选的注释
1) 基于基因的注释(Gene-based Annotation)
Gene-based annotation是根据SNPs以及CNVs的位置信息来确定是否会造成编码序列以及开放阅读框的改变从而影响氨基酸的改变,使用者可以自主选择RefSeq genes, 包括UCSC genes, ENSEMBL genes, GENCODE genes, AceView genes等来进行注释。注释后会生成两个文件:ex1.variant_function 和 ex1.exonic_variant_function。揭示variant与已知基因直接的关系以及对其产生的功能性影响。
命令示例:
# hg19
annotate_variation.pl -geneanno -dbtype refGene -out ex1 -build hg19 example/ex1.avinput humandb/
# -geneanno 表示使用基于基因的注释
# -dbtype refGene 表示使用"refGene"数据库
# -out ex1 表示输出文件以ex1为前缀
# annotate_variation.pl默认使用gene-based注释类型以及refGene数据库,命令可以缺省-geneanno -dbtype
# hg38
annotate_variation.pl -geneanno -dbtype refGene --outfile SNP.final.gene -buildver hg38 SNP.final.avinput humandb/
ex1.variant_function 注释所有变异所在基因及位置。第1列为变异所在的类型,如外显子等,第2列是对应的基因名(若有多个基因名用,隔开),第3-7列为输入的那5列主要信息,剩余为注释信息。如果变异找到多种注释,ANNOVAR将根据优先权重进行比较(见下表),取最优的表示,可以使用-seperate参数列出该变异所有注释。
#ex1.variant_function
cat ex1.variant_function
# UTR5 ISG15(NM_005101:c.-33T>C) 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
# UTR3 ATAD3C(NM_001039211:c.*91G>T) 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
# splicing NPHP4(NM_001291593:exon19:c.1279-2T>A,NM_001291594:exon18:c.1282-2T>A,NM_015102:exon22:c.2818-2T>A) 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
# intronic DDR2 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
# intronic DNASE2B 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
# intergenic LOC645354(dist=11566),LOC391003(dist=116902) 1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion
# intergenic UBIAD1(dist=55105),PTCHD2(dist=135699) 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
# intergenic LOC100129138(dist=872538),NONE(dist=NONE) 1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution
# exonic IL23R 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
# exonic ATG16L1 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
# exonic NOD2 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
# exonic NOD2 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
# exonic NOD2 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
# exonic GJB2 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
# exonic CRYL1,GJB6 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
# ex1.variant_function
# 为方便了解每一列的含义,只输出第一行,然后把制表符改成换行符
cat ex1.variant_function | head -n 1 | tr '\t' '\n'
### 第一行内容如下
# UTR5
# ISG15(NM_005101:c.-33T>C)
# 1
# 948921
# 948921
# T
# C
# comments: rs15842, a SNP in 5' UTR of ISG15
# 第一个文件包括对于所有突变的注释,通过在文件最前面加入两列,以tab分割
# 第一列为变异所在基因位置的类型,如外显子,内含子,UTR5,UTR3,基因间等
# 第二列为对第一列的描述信息,详情见下
来源: https://annovar.openbioinformatics.org/en/latest/user-guide/gene/
ex1.exonic_variant_function 详细注释外显子区域的变异功能、类型、氨基酸改变等。第1列为.variant_function文件中该变异所在行号,第2列为变异功能性后果,如外显子改变导致的氨基酸变化,阅读框移码,无义突变,终止突变等,第3列包括基因名称、转录识别标志和相应的转录本的序列变化,第4-9列为输入文件内容。
#ex1.exonic_variant_function
cat ex1.exonic_variant_function
# line9 nonsynonymous SNV IL23R:NM_144701:exon9:c.G1142A:p.R381Q, 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
# line10 nonsynonymous SNV ATG16L1:NM_001190267:exon9:c.A550G:p.T184A,ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_001190266:exon9:c.A646G:p.T216A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_198890:exon5:c.A409G:p.T137A, 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
# line11 nonsynonymous SNV NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_001293557:exon3:c.C2023T:p.R675W, 16 50745926 50745926 C comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
# line12 nonsynonymous SNV NOD2:NM_022162:exon8:c.G2722C:p.G908R,NOD2:NM_001293557:exon7:c.G2641C:p.G881R, 16 50756540 50756540 G comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
# line13 frameshift insertion NOD2:NM_022162:exon11:c.3017dupC:p.A1006fs,NOD2:NM_001293557:exon10:c.2936dupC:p.A979fs, 16 50763778 5076377comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
# line14 frameshift deletion GJB2:NM_004004:exon2:c.35delG:p.G12fs, 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss line15 frameshift deletion GJB6:NM_001110221:wholegene,GJB6:NM_001110220:wholegene,GJB6:NM_001110219:wholegene,CRYL1:NM_015974:wholegene,GJB6:NM_006783:wholegene, 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
# ex1.exonic_variant_function
# 为方便了解每一列的含义,只输出第一行,然后把制表符改成换行符
cat ex1.exonic_variant_function | head -n 1 | tr '\t' '\n'
### 第一行内容如下
# line9
# nonsynonymous SNV
# IL23R:NM_144701:exon9:c.G1142A:p.R381Q,
# 1
# 67705958
# 67705958
# G
# A
# comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
# 查看有多少种gene variant
cat ex1.exonic_variant_function |cut -f2|cut -d" " -f1|sort |uniq -c |sort -nr
# 结果
# 4 nonsynonymous
# 4 frameshift
# 1 stoploss
# 1 stopgain
# 1 nonframeshift
# 第二个输出文件以.exonic_variant_function结尾,只列出外显子(氨基酸会改变)的变异
# 第一列为第一个文件中该变异所在的行号;
# 第二列为该变异的功能性后果,如外现在改变导致的氨基酸变化,阅读框移码等,详情见下
# 第三列为基因名称,转录识别标志和相应的转录本的序列变化
# 第4-9列为原输入文件内容
2) 基于区域的注释(Region-based Annotation)
其与Gene-based annotation作用相反,它是用来确认在特定区域的突变造成的影响。揭示variant与不同基因组特定区域的关系,例如:它是否落在已知的保守基因组区域(conserved genomic region),预测的转录因子结合区域(transcription factor binding site),基因重复区域(segmental dupliaction),GWAS分析区域等,还可以注释染色体坐标(cytoBand)。基于区域的注释的数据库一般由UCSC提供。此处以Conserved genomic elements annotation为例介绍region-based annotation的使用:
命令示例:
#数据库下载
annotate_variation.pl -build hg19 -downdb phastConsElements46way humandb/
#NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
#NOTICE: Downloading annotation database http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/phastConsElements46way.txt.gz ... OK
#NOTICE: Uncompressing downloaded files
#NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory
#使用下载数据库进行注释
annotate_variation.pl -regionanno -build hg19 -out ex1 -dbtype phastConsElements46way example/ex1.avinput humandb/
#NOTICE: Reading annotation database humandb/hg19_phastConsElements46way.txt ... Done with 5163775 regions
#NOTICE: Finished region-based annotation on 12 genetic variants in ex1.hg19.avinput
#NOTICE: Output files were written to ex1.hg19_phastConsElements46way
# -regionanno 表示使用基于区域的注释
# -dbtype phastConsElements46way 表示使用"phastConsElements46way"数据库,注意需要使用Region-based的数据库
#输出文件
cat ex1.hg19_phastConsElements46way
#phastConsElements46way Score=387;Name=lod=50 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
#phastConsElements46way Score=420;Name=lod=68 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
#phastConsElements46way Score=385;Name=lod=49 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
#phastConsElements46way Score=395;Name=lod=54 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
#phastConsElements46way Score=545;Name=lod=218 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss
#输出文件:输出的注释文件第1列为“phastConsElements46way”,对应注释的类型,这里的phastCons 46-way alignments属于保守的基因组区域的注释;
#第二列包含评分和名称,评分来自UCSC,可以使用--score_threshold和--normscore_threshold来过滤评分低的变异,“Name=lod=x”名称表示该区域的名称;
#剩余的部分为输入文件的内容。
输出的注释文件第1列为注释文件库名,这里的phastCons 46-way alignments属于保守的基因组区域的注释,第二列包含评分和名称,可以使用-score_threshold和-normscore_threshold来过滤评分低的变异,剩余部分为输入文件的内容。
3) 基于筛选的注释(Filter-based Annotation)
Filter-based annotation是用以确认已记录在特定数据库里的突变。例如想要知道突变是否为novel variation就需要知道该突变是否存在于dbSNP库里,它在1000 genome project里面等位基因频率怎样,以及计算一系列突变项目得分并加以过滤。它区别于region-based annotation就在于它针对突变碱基进行工作,而region-based annotation 针对染色体位置。举例来说就是region-based比对chr1:1000-1000而filter-based比对chr1:1000-1000上的A->G。
它拥有多种数据库,包括针对全基因组测序的突变频率,针对全外显子数据测序的突变频率,在孤立或者小类群人群中的突变频率,全基因组数据突变的功能预测,全外显子组突变的功能预测,剪切变异体的功能预测,疾病相关突变,突变确认等。
下面给介绍常用的两种过滤注释:
运行命令后,已存在于数据库中的变异写入*.droped文件,在数据库中不存在的变异信息将会被写入到*filtered文件。
1> 1000 Genomes Project annotations
命令示例:
annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 -out ex1 example/ex1.avinput humandb/
#NOTICE: Variants matching filtering criteria are written to ex1.hg19_EUR.sites.2012_04_dropped, other variants are written to ex1.hg19_EUR.sites.2012_04_filtered
#NOTICE: Processing next batch with 15 unique variants in 15 input lines
#NOTICE: Database index loaded. Total number of bins is 2766067 and the number of bins to be scanned is 12
#NOTICE: Scanning filter database humandb/hg19_EUR.sites.2012_04.txt...Done
#查看数据格式
cat ex1.hg19_EUR.sites.2012_04_dropped
#1000g2012apr_eur 0.04 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
#1000g2012apr_eur 0.87 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
#1000g2012apr_eur 0.81 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
#1000g2012apr_eur 0.06 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
#1000g2012apr_eur 0.54 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
#1000g2012apr_eur 0.96 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
#1000g2012apr_eur 0.05 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
#1000g2012apr_eur 0.01 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
#1000g2012apr_eur 0.01 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
#1000g2012apr_eur 0.53 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
# -filter 使用基于过滤的注释
# -dbtype 1000g2012apr_eur 使用"1000g2012apr_eur"数据库
输出的注释文件第1列为注释文件库名,第二列为等位基因频率,可以使用-maf 0.05 参数来过滤掉低于0.05的变异,也可以使用-maf 0.05 -reverse 参数来过滤掉高于0.05的变异,推荐使用-score_threshold 参数来过滤ALT等位基因的频率,剩余部分为输入文件的内容。
2> dbSNP annotations
通过dbsnp annotation, annovar可以确认已经出现在dbSNP数据库里面的突变并且注释SNP identifiers命令如下:
#下载dbsnpp138数据库
annotate_variation.pl -downdb -buildver hg19 -webfrom annovar snp138 humandb
#NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
#NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_snp138.txt.gz ... OK
#NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_snp138.txt.idx.gz ... OK
#NOTICE: Uncompressing downloaded files NOTICE: Finished downloading annotation files for hg18 build version, with files saved at the 'humandb' directory
#使用dbsnp138注释
annotate_variation.pl -filter -out ex1 -build hg19 -dbtype snp138 example/ex1.avinput humandb/
#NOTICE: Variants matching filtering criteria are written to ex1.hg19_snp138_dropped, other variants are written to ex1.hg19_snp138_filtered
#NOTICE: Processing next batch with 15 unique variants in 15 input lines
#NOTICE: Database index loaded. Total number of bins is 2858459 and the number of bins to be scanned is 12
#NOTICE: Scanning filter database humandb/hg19_snp138.txt...Done
#输入dropped file
cat ex1.hg19_snp138_dropped
#snp138 rs35561142 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion
#snp138 rs149123833 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
#snp138 rs1000050 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
#snp138 rs1287637 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
#snp138 rs11209026 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
#snp138 rs6576700 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
#snp138 rs15842 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
#snp138 rs80338939 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss
#snp138 rs2066844 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
#snp138 rs2066845 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
#snp138 rs2066847 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
#snp138 rs2241880 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
#*dropped文件
#第一列如region-based注释的结果一样以数据库命名;
#第二列为已经在数据库的突变的indentifier号;
#第三列开始同样是输入文件的内容。
参考阅读:
1. annovar官网:https://annovar.openbioinformatics.org/en/latest/user-guide/download/
2. annovar使用笔记 https://www.jianshu.com/p/84c818207240
3. ANNOVAR-注释软件用法详解 https://zhuanlan.zhihu.com/p/133832384
4. 生信技能树 - ANNOVAR 软件用法还可以更复杂
5. ANNOVAR的使用 https://www.jianshu.com/p/95331e7a98cd
6. 下载cosmic数据库以及转换为annovar可识别的格式 https://blog.csdn.net/Cassiel60/article/details/107004201/