你以为是参考基因组问题结果确实bed文件问题
最近配置了一台专供肿瘤外显子数据分析新服务器(64核512G内存32T硬盘),部署GATK的小鼠流程,发现总是运行到gvcf步骤就挂了,百思不得其解,只好拆开每个步骤慢慢探索。
首先不得不承认GATK 参考基因组很恶心
假如你使用的是这样的染色体顺序参考基因组:
>chr10
>chr11
>chr12
>chr13
>chr14
>chr15
>chr16
>chr17
>chr18
>chr19
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
>chr7
>chr8
>chr9
>chrM
>chrX
>chrY
那么你得到的bam文件是:
HL-WES-01.855229 163 chr10 3100373
HL-WES-01.855229 83 chr10 3100507
HL-WES-01.37500662 99 chr10 3100509
HL-WES-01.37500662 147 chr10 3100603
HL-WES-01.26981641 163 chr10 3100643
HL-WES-01.12556859 163 chr10 3100713
HL-WES-01.26981641 83 chr10 3100742
HL-WES-01.12556859 83 chr10 3100781
HL-WES-01.13955634 99 chr10 3101044
HL-WES-01.22796103 177 chr10 3101165
可以看到,染色体上面的坐标是排序的,但是染色体本身的排序方式,与参考基因组等同,所以没毛病
这个时候的代码也很简单是:
bwa mem -t 5 -M -R "@RG\tID:$sample\tSM:$sample\tLB:WES\tPL:Illumina" $INDEX $fq1 $fq2 1>$sample.sam
$GATK --java-options "-Xmx25G -Djava.io.tmpdir=./" SortSam \
-SO coordinate -I $sample.sam -O $sample.bam #1>log.sort 2>&1
所以也就是说 GATK
的 SortSam
的 -SO coordinate
模式 默认按照 参考基因组本身的 染色体顺序。
接着看看gvcf步骤的bed文件
我是从CCDS下载的的bed文件,其染色体顺序是:
chr1
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chrX
chrY
看起来不太妙哦, 这个bed来自于代码:
wget ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current_mouse/CCDS.current.txt
cat CCDS.current.txt|perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i > exon_probe.mm10.gene.bed
awk '{print "chr"$0"\t0\t+"}' exon_probe.mm10.gene.bed > mm10.baits.bed
awk '{print $1"\t"$2-150"\t"$3+150"\t"$4"\t"$5"\t"$6}' mm10.baits.bed > mm10.baits.150bp.bed
有多种解决方案,把它的顺序调整为比对的参考基因组那样。
调整BED顺序
比如,首先按照染色体拆分
awk '{print >> ("mm10.baits.150bp.bed."$1)}' mm10.baits.150bp.bed
得到拆开后的文件
6.5M Mar 25 10:13 mm10.baits.150bp.bed
431K Mar 25 10:51 mm10.baits.150bp.bed.chr1
313K Mar 25 10:51 mm10.baits.150bp.bed.chr10
523K Mar 25 10:51 mm10.baits.150bp.bed.chr11
220K Mar 25 10:51 mm10.baits.150bp.bed.chr12
222K Mar 25 10:51 mm10.baits.150bp.bed.chr13
233K Mar 25 10:51 mm10.baits.150bp.bed.chr14
267K Mar 25 10:51 mm10.baits.150bp.bed.chr15
199K Mar 25 10:51 mm10.baits.150bp.bed.chr16
315K Mar 25 10:51 mm10.baits.150bp.bed.chr17
163K Mar 25 10:51 mm10.baits.150bp.bed.chr18
220K Mar 25 10:51 mm10.baits.150bp.bed.chr19
550K Mar 25 10:51 mm10.baits.150bp.bed.chr2
304K Mar 25 10:51 mm10.baits.150bp.bed.chr3
418K Mar 25 10:51 mm10.baits.150bp.bed.chr4
435K Mar 25 10:51 mm10.baits.150bp.bed.chr5
326K Mar 25 10:51 mm10.baits.150bp.bed.chr6
489K Mar 25 10:51 mm10.baits.150bp.bed.chr7
328K Mar 25 10:51 mm10.baits.150bp.bed.chr8
380K Mar 25 10:51 mm10.baits.150bp.bed.chr9
232K Mar 25 10:51 mm10.baits.150bp.bed.chrX
8.5K Mar 25 10:51 mm10.baits.150bp.bed.chrY
然后把它们按照自己的想法进行排序
cat mm10.baits.150bp.bed.chr10 \
mm10.baits.150bp.bed.chr11 \
mm10.baits.150bp.bed.chr12 \
mm10.baits.150bp.bed.chr13 \
mm10.baits.150bp.bed.chr14 \
mm10.baits.150bp.bed.chr15 \
mm10.baits.150bp.bed.chr16 \
mm10.baits.150bp.bed.chr17 \
mm10.baits.150bp.bed.chr18 \
mm10.baits.150bp.bed.chr19 \
mm10.baits.150bp.bed.chr1 \
mm10.baits.150bp.bed.chr2 \
mm10.baits.150bp.bed.chr3 \
mm10.baits.150bp.bed.chr4 \
mm10.baits.150bp.bed.chr5 \
mm10.baits.150bp.bed.chr6 \
mm10.baits.150bp.bed.chr7 \
mm10.baits.150bp.bed.chr8 \
mm10.baits.150bp.bed.chr9 \
mm10.baits.150bp.bed.chrX \
mm10.baits.150bp.bed.chrY > new.bed
虽然我按照参考基因组重新排序了,但是GATK的HaplotypeCaller -ERC GVCF
仍然是报错的,很明显,而且这个报错非常隐晦, 跟bed文件本身 风马牛不相及
11:00:27.459 INFO HaplotypeCaller - Initializing engine
11:00:28.229 INFO FeatureManager - Using codec BEDCodec to read file file:///home/fhscancer/jimmy/tnbc/wes/germline/new.bed
11:00:29.545 INFO IntervalArgumentCollection - Processing 86717910 bp from intervals
11:00:29.785 INFO HaplotypeCaller - Shutting down engine
[March 25, 2019 11:00:29 AM CST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=2360868864
java.lang.NullPointerException
at java.util.ComparableTimSort.countRunAndMakeAscending(ComparableTimSort.java:325)
at java.util.ComparableTimSort.sort(ComparableTimSort.java:202)
at java.util.Arrays.sort(Arrays.java:1312)
at java.util.Arrays.sort(Arrays.java:1506)
如果是正确运行,应该是:
11:12:28.567 INFO FeatureManager - Using codec BEDCodec to read file file:///home/fhscancer/jimmy/tnbc/wes/germline/tmp.bed
11:12:28.856 INFO IntervalArgumentCollection - Processing 4060302 bp from intervals
11:12:28.930 INFO HaplotypeCaller - Done initializing engine
11:12:29.112 INFO HaplotypeCallerEngine - Standard Emitting and Calling confidence set to 0.0 for reference-model confidence output
11:12:29.113 INFO HaplotypeCallerEngine - All sites annotated with PLs forced to true for reference-model confidence output
11:12:31.631 INFO NativeLibraryLoader - Loading libgkl_utils.so from jar:file:/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_utils.so
11:12:31.831 INFO NativeLibraryLoader - Loading libgkl_pairhmm_omp.so from jar:file:/home/jianmingzeng/biosoft/GATK/gatk-4.0.3.0/gatk-package-4.0.3.0-local.jar!/com/intel/gkl/native/libgkl_pairhmm_omp.so
11:12:32.400 WARN IntelPairHmm - Flush-to-zero (FTZ) is enabled when running PairHMM
11:12:32.401 INFO IntelPairHmm - Available threads: 12
11:12:32.401 INFO IntelPairHmm - Requested threads: 4
11:12:32.401 INFO PairHMM - Using the OpenMP multi-threaded AVX-accelerated native PairHMM implementation
11:12:32.723 INFO ProgressMeter - Starting traversal
11:12:32.723 INFO ProgressMeter - Current Locus Elapsed Minutes Regions Processed Regions/Minute
11:12:42.847 INFO ProgressMeter - chr10:3952161 0.2 80 474.1
不过,既然排除了bed染色体顺序问题,而且我去除bed文件,这个步骤就不报错,再次截取bed文件的前10行,这个流程也不报错,所以必定是bed文件里面某些地方有问题。
然后我想起来了:数据处理过程中有的是意外
最后解决方案
就是去除 CCDS.current.txt 文件里面的 非 Public那些记录即可;
grep Public CCDS.current.txt|perl -alne '{/\[(.*?)\]/;next unless $1;$gene=$F[2];$exons=$1;$exons=~s/\s//g;$exons=~s/-/\t/g;print "$F[0]\t$_\t$gene" foreach split/,/,$exons;}'|sort -u |bedtools sort -i > exon_probe.mm10.gene.bed
如果你完全没有看懂我在讲什么,那你可能需要下面的课程: