科研小鼠的参考基因组
该计划于2013年完成,数据结果全部开放下载:SNP and indel calls for Version 3 can be found here:ftp://ftp-mouse.sanger.ac.uk/REL-1303-SNPs_Indels-GRCm38/SNP and indel calls for 18 mouse genomes are provided as a singlecompressed VCF file (bgzip), along with an index file generated by'tabix' (*.tbi).测序数据量小鼠品系简称小鼠品系详情平均测序深度129P2(129P2/OlaHsd)42129S1(129S1/SvImJ)55129S5(129S5SvEvBrd)18AJ(A/J)38AKR(AKR/J)40BALBcJ(BALB/cJ)52C3HHeJ(C3H/HeJ)49C57BL6NJ(C57BL/6NJ)48CASTEiJ(CAST/EiJ)39CBAJ(CBA/J)43DBA2J(DBA/2J)42FVBNJ(FVB/NJ)61LPJ(LP/J)41NODShiLtJ(NOD/ShiLtJ)48NZO(NZO/HILtJ)58PWKPhJ(PWK/PhJ)39Spretus(SPRET/EiJ)53WSBEiJ(WSB/EiJ)38参考基因组All SNP and indel calls are relative to the reference mouse genomeC57BL/6J (GRCm38). A version of the reference genome can befound here: ftp://ftp-mouse.sanger.ac.uk/ref/dbSNP数据库注释SNPs and indels are annotated with rs IDs from dbSNP Build 137. ThedbSNP data was downloaded from:ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/mouse_10090/VCF/and the 'vcf-annotate' Perl utility from the VCFtools package(Danecek et al, 2011) was used to add the rsIDs to calls in thisrelease. (See below for VCFtools information.)For SNPs, the position, reference allele and alternative alleles were all compared:eg: vcf-annotate -c CHROM,POS,ID,REF,ALTFor indels, only positions were matched:eg: vcf-annotate -c CHROM,POS,ID找变异的流程# Sequence DataSequencing was performed using the Illumina HiSeq platform. Allreads are 100bp paired-end reads, except for strains 129P2 and129S4. All mice were female and therefore SNPs and indels werecalled on chromosome 1-19 and X only. The BAM files used to callSNPs and indels are located in this directory:ftp://ftp-mouse.sanger.ac.uk/REL-1302-BAM-GRCm38/# Methods in briefReads were aligned to the reference genome (GRCm38) using BWAversion 0.5.9-r16 (Li and Durbin, 2009). SNPs and indel discoverywas performed with the SAMtools mpileup function and callingwas performed with the BCFtools view function (Li H, 2011). Thevcf-annotate function in VCFtools package (Danecek et al, 2011)was used to soft-filter the SNP and indel calls.The Variant Effect Predictor software from Ensembl (McLaren et al.,2010) was used to predict the functional consequences of SNP andindels and queried against Ensembl release 70 mouse gene models.Definitions of consequence types can be found here:http://www.ensembl.org/info/docs/variation/predicted_data.html#consequencesIndel calling was performed on each strain independently. Thecalls from all 18 strains were then merged into a single VCFfile. SNP calls were also made independently for each straininitially. Then, a single list of all high confidence polymorphicsites across the genome was produced from all 18 strains. Thislist was then used to call SNPs again, this time across all 18strains simultaneously, using the 'samtools mpileup -l' option.This process generates both reference-only genotype calls as wellas calls with non-reference bases across the 18 strains.Information regarding the filtering of SNP and indel calls inlocated in the VCF file headers in the '##FILTER' and'##source_xxxxxx=vcf-annotate' lines.得到的标准vcf变异记录文件因为参考小鼠基因组选择的是就是C57BL/6NJ,所以对该品系小鼠来说,变异位点应该是很少的。不同品系小鼠统计StrainSNPsts/tvPrivate SNPs%Privatets/tv (Private SNPs)INDELsPrivate INDELs%Private129P2/OlaHsd53339402.0324247(0.45%)1.9586945335585(4.09%)129S1SvEvBrd51970512.031696(0.03%)1.67101865430217(2.97%)129S5/SvImJ49295662.074134(0.08%)1.456789329094(1.34%)A/J48932292.0242833(0.88%)2.0792225628695(3.11%)AKR/J48967832.0684307(1.72%)2.1293155239740(4.27%)BALB/cJ45788622.0129733(0.65%)2.0492489734178(3.70%)C3H/HeJ50939472.0215371(0.30%)1.89101468731161(3.07%)C57BL/6NJ159460.981522(9.54%)1.7208521646(7.89%)CAST/EiJ206266442.045785024(28.05%)2.130622891006241(32.86%)CBA/J52236902.0234464(0.66%)2.02101444934911(3.44%)DBA/2J51697302.0273319(1.42%)2.1398147140955(4.17%)FVB/NJ48369682.03133983(2.77%)2.1296839854942(5.67%)LP/J54405972.0353756(0.99%)2.09102414936083(3.52%)NOD/ShiLtJ51012682.04124970(2.45%)2.197049752166(5.38%)NZO/HlLtJ53358072.03214884(4.03%)2.13104665382356(7.87%)PWK/PhJ202681632.035016466(24.75%)2.13044259909692(29.88%)SPRET/EiJ417423491.9425792444(61.79%)1.9450773873279813(64.60%)WSB/EiJ70799072.03915416(12.93%)2.121414664233808(16.53%)vcf文件的详解# VCF specification and VCFtoolsThe VCF file format specification can be found here:http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41http://vcftools.sourceforge.net/The VCFtools software package (Danecek et al, 2011) can be used toquery, compare, and annotate VCF files.# Notes regarding the mgp.v3 VCF files- Information regarding the filters applied to the calls is locatedin the VCF file header lines at the beginning of the file, markedwith a hash '#' at the beginning of the line.- Genotypes (GT):- '.' = no genotype call was made- '0/0' = genotype is the same as the reference geneome- '1/1' = homozygous alternative allele; can also be '2/2','3/3', etc. if more than one alternative allele is present.- '0/1' = heterozygous genotype; can also be '1/2', '0/2', etc.- FITLER column and high and low confidence calls:High and low confidence genotype calls are distinguished bythe 'FI' tag in the FORMAT column for each sample.eg: in the sample columns NODShiLTJ and NZO:1/1:99:31:0:255,74,0:1 0/0:.:1:0:0,.,.:0which corresponds to the tags in the FORMAT columnGT:GQ:DP:SP:PL:FIIn the NODShiLTJ column the genotype is '1/1'and 'FI' tag is '1' indicating the genotype callpassed all filters and is high-confidence. In NZA,the genotype is the same as the reference genome,however 'FI' is '0', meaning the call failed oneor more filters and the call is low-confidence.NOTES:All heterozygous calls have been marked as low confidence with the'FI' tag set to '0'. 'Het' has also been added to the FILTERcolumn.A site is annotated with PASS in the FILTER column only if ALLstrains with a genotype call (including 0/0 genotype calls) atthat site pass all filter criteria. If one or more calls does NOTpass filtering, filters which the calls have failed are listed inthe FILTER column, and the 'FI' tags are set to '0' for the failedsample calls. No-call sites, marked as '.', are not included.eg: FORMAT is GT:GQ:DP:SP:PL:FI(a) MinDP 1/1:7:3:0:22,0,4:0 . 1/1:99:45:255:74,0:1 .(b) PASS 1/1:99:31:0:255,74,0:1 . . 1/1:99:45:0:255,50,0:1In example (a), there are 2 no-calls ('.'), the first sample failedthe MinDP filter, and the third sample passed all filters. TheFILTER column is set to 'MinDP'. In example (b), there are also 2no-calls, and the first and fourth samples passed all filters. TheFILTER column is set to PASS.- Functional consequencesEnsembl now uses consequence terms defined by the Sequence Ontology(SO) by default. All definitions of the predicted functionalconsequences can be found here:http://www.ensembl.org/info/docs/variation/predicted_data.html#consequencesIn our release VCF files, predicted functional consequences are indicated bythe 'CSQ' field in the INFO tag. Where no 'CSQ' tag is present, the SNPor indel is classified under the SO term 'intergenic_variant'.- Multiple alternative allele and consequencesIn cases where different strains have different alternative alleleswhich have different consequences, they can be distinguished bychecking the 'Allele' in the 'CSQ' line.eg: Alternative alleles = G,T and CSQ=ENSMUST00000047577:ENSMUSG00000042414:missense_variant:601:201:A>P:Grantham,27:Allele,G:Gene,Prdm14+ENSMUST00000047577:ENSMUSG00000042414:missense_variant:601:201:A>T:Grantham,58:Allele,T:Gene,Prdm14The strain with GT='1/1' is G/G and has a A>P amino acidsubstitution, and the strain with GT='2/2' is T/T has a A>T aminoacid substitution.初学者必须花13个小时仔细研读该数据库