你以为的可能不是你以为的
最近生信技能树管理员小朋友XZG跟我炫耀他植物的简化基因组的gvcf模式,两百个测序数据,我一直没用过这个gvcf功能,因为的确没有需求。癌症研究,关注的主要是somatic mutation。
而且一直以为gvcf是把所有位点都report出来,不管那个位点有没有突变与否,只需要测序覆盖到了!!所以输出的文件行数会非常夸张!!!因为正常人的germline mutation在千分之一左右,所以gvcf得到的vcf会暴涨到1000倍行数。
但是简单的测试了一个坐标区间,发现事实可能不是我想象的那么简单!
$GATK HaplotypeCaller -ERC GVCF -L $bed -R $GENOME -I $bam --dbsnp $DBSNP -O tmp.vcf
我使用的gatk版本信息如下:
Using GATK jar /home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false
-Dsamjdk.use_async_io_write_samtools=true
-Dsamjdk.use_async_io_write_tribble=false
-Dsamjdk.compression_level=2
-Xmx20G -Djava.io.tmpdir=./
-jar /home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar HaplotypeCaller
USAGE: HaplotypeCaller [arguments]
Call germline SNPs and indels via local re-assembly of haplotypes
Version:4.0.3.0
我测试的这个坐标区别是 1217 bp
,在gatk的运行日志也可以看到
IntervalArgumentCollection - Processing 1217 bp from intervals
2249 read(s) filtered by: MappingQualityReadFilter
那么,按照道理,出来的vcf应该是1217行,结果:
grep -v "^#" tmp.vcf|wc
119 1190 9875
那么,为什么会缺失这么多呢?我首先想到的是我给的坐标区间大部分并没有被这个bam文件所覆盖,也就是说,没有测序到。
所以我用下面的命令检验:
samtools mpileup -r chr1:68940-70157 $bam |wc
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
1218 7308 581882
发现我给的坐标区间都被覆盖了,然后我想到了可能是很多reads测序质量太差被过滤了,所以我看了看质量
samtools view $bam chr1:68940-70157|wc
2896 57920 1475460
samtools view $bam -q 20 chr1:68940-70157|wc
651 13020 334253
看起来,测序质量高的reads不多,因为我测试坐标的特殊性。然后我看了看gvcf文件的前五列:
grep -v "^#" tmp.vcf|cut -f 1-5|head
chr1 68941 . A <NON_REF>
chr1 69072 . G <NON_REF>
chr1 69073 . A <NON_REF>
chr1 69083 . A <NON_REF>
chr1 69090 . T <NON_REF>
chr1 69101 . A <NON_REF>
chr1 69130 . C <NON_REF>
chr1 69159 . G <NON_REF>
chr1 69184 . G <NON_REF>
chr1 69185 . T <NON_REF>
有些坐标跳跃了,也有些是连续的,再跟mpileup对应一下,如下:
samtools mpileup -r chr1:68940-70157 $bam |cut -f 1,2,4,5|head
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1 68940 35 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T
chr1 68941 36 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1 68942 36 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
chr1 68943 36 A$AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1 68944 36 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G
chr1 68945 37 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1 68946 38 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1 68947 38 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1 68948 40 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G^!G
chr1 68949 41 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T
实在是太奇怪了,那些被跳跃的位点,的确有测序深度信息呀!
而且,虽然 -q 20 -Q 20 这样的reads过滤可以大幅度丢失reads,但其实并不会丢失坐标信息:
[jianmingzeng@jade germline]$ samtools mpileup -q 20 -Q 20 -r chr1:68940-70157 $bam |head
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1 69072 N 1 ^:G ?
chr1 69073 N 2 A^:A C?
chr1 69074 N 2 GG >>
chr1 69075 N 2 TT AA
chr1 69076 N 2 GG ??
chr1 69077 N 2 AA AC
chr1 69078 N 2 AA BB
chr1 69079 N 2 AA CB
chr1 69080 N 2 CC AA
chr1 69081 N 2 GG 99
[jianmingzeng@jade germline]$ samtools mpileup -r chr1:68940-70157 $bam |head |cut -f 1-5
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000
chr1 68940 N 35 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!T
chr1 68941 N 36 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1 68942 N 36 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
chr1 68943 N 36 A$AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1 68944 N 36 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G
chr1 68945 N 37 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1 68946 N 38 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA^!A
chr1 68947 N 38 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
chr1 68948 N 40 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^!G^!G
chr1 68949 N 41 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT^!Tl
不过还没等我仔细分析个中缘由,XZG就直接甩给了我一个链接,里面很清楚写到
With
GVCF
, you get a gVCF with individual variant records for variant sites, but the non-variant sites are grouped together into non-variant block records that represent intervals of sites for which the genotype quality (GQ) is within a certain range or band.
block挺好的呀,节省空间,要真的是bp分辨率记录,那么文件会大的可怕! 不过没有突变的地方既然被折叠了,也就失去了该位点的测序深度信息了。
不过仔细想想,GATK团队应该不至于没考虑到这一点,可能是有着同样测序深度的地方才会被折叠!问题在于我选取了前五列查看信息,其实应该看全部:
grep -v "^#" tmp.vcf|cut -f 1-8|head
chr1 68941 . A <NON_REF> . . END=69071
chr1 69072 . G <NON_REF> . . END=69072
chr1 69073 . A <NON_REF> . . END=69082
chr1 69083 . A <NON_REF> . . END=69089
chr1 69090 . T <NON_REF> . . END=69100
chr1 69101 . A <NON_REF> . . END=69129
chr1 69130 . C <NON_REF> . . END=69158
chr1 69159 . G <NON_REF> . . END=69183
chr1 69184 . G <NON_REF> . . END=69184
chr1 69185 . T <NON_REF> . . END=69198
折叠的非常巧妙。
具体GATK的gvcf可以阅读:https://software.broadinstitute.org/gatk/documentation/article.php?id=3893
https://gatkforums.broadinstitute.org/gatk/discussion/4017/what-is-a-gvcf-and-how-is-it-different-from-a-regular-vcf