如何对多个转录组测序数据找变异呢
以前生信技能树发过这个教程:
第一次对参考基因组建索引
然后进行第一次序列比对
之后根据第一次比对得到的所有剪切位点,重新对参考基因组建立索引
再进行 STAR 二次序列比对
由于后面要用 GATK 进行 call 变异,还需要对比对结果 SAM 文件进行一些处理,这些都可以用 picard 来做,包括 使用AddOrReplaceReadGroups命令SAM头文件添加 @RG 标签,SAM 文件排序并转 BAM 格式,然后使用MarkDuplicates命令标记 duplicate,由于 STAR 软件使用的 MAPQ 标准与 GATK 不同,而且比对时会有 reads 的片段落到内含子区间,需要进行一步 MAPQ 同步和 reads 剪切,使用 GATK 专为 RNA-seq 应用开发的工具 SplitNCigarReads
进行操作,它会将落在内含子区间的 reads 片段直接切除,并对 MAPQ 进行调整。
问题描述:
GATK Best Practices for RNA-seq suggests using the STAR 2-pass alignment steps to do the mapping. It sounds very good. But in operation, maybe there are some issues. As shown in the methods, for every sample, the genome index should be generated twice, which is time-consuming for big genome such as human. Moreover, it seems that we can not do alignment for two samples at same time with the same genome path, right? Should we copy a genome.fa for every sample in an individual place? Seems crazy for a big project.
解答:
其实可以首先把fastq批量比对后得到多个.tab后缀文件,直接cat到一起作为第二次构建index的参考即可。
比对后得到的bam文件批量走gatk的找变异流程,代码如下:
cat $1 |while read id
do
echo $id
file=$(basename $id )
sample=${file%%_*}
echo $sample
java -Djava.io.tmpdir=$TMPDIR -Xmx25g -jar $PICARD AddOrReplaceReadGroups \
I=$id O=${sample}.bam SO=coordinate RGID=${sample} RGLB=rna \
RGPL=illumina RGPU=hiseq RGSM=${sample}
java -Djava.io.tmpdir=$TMPDIR -Xmx25g -jar $PICARD MarkDuplicates \
INPUT=${sample}.bam OUTPUT=${sample}_marked.bam METRICS_FILE=$sample.metrics REMOVE_DUPLICATES=TRUE
java -Djava.io.tmpdir=$TMPDIR -Xmx25g -jar $PICARD FixMateInformation \
INPUT=${sample}_marked.bam OUTPUT=${sample}_marked_fixed.bam SO=coordinate
samtools index ${sample}_marked_fixed.bam
java -Djava.io.tmpdir=$TMPDIR -Xmx40g -jar $GATK -T SplitNCigarReads \
-R $GENOME -I ${sample}_marked_fixed.bam -o ${sample}_marked_fixed_split.bam \
-rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS
#--fix_misencoded_quality_scores
## --fix_misencoded_quality_scores only if phred 64
java -Djava.io.tmpdir=$TMPDIR -Xmx25g -jar $GATK -T HaplotypeCaller \
-R $GENOME -I ${sample}_marked_fixed_split.bam --dbsnp $DBSNP \
-stand_emit_conf 10 -o ${sample}_raw.vcf
done