【直播】我的基因组70:比对文件并不能完美的还原出测序文件

前面我们说到过可以用软件或者自己写脚本从已经比对到参考基因组的sam/bam格式文件提取出原始的测序fastq文件。

但是我在IGV里面检查bam文件的时候发现了一些难以理解的现象,所以趁这个机会把它们探究清楚。

bwa工具的不同版本影响大吗?

bwa对同样测序文件同样参数比对多次结果一样吗?

bwa对一个PE reads只输出两条记录吗?

bwa的-M参数是干什么的?

bwa会截断原始fastq序列吗?

首先可以找到各个版本的软件的下载地址,如下:

https://sourceforge.net/projects/bio-bwa/files/?source=navbar

## 安装软件很简单,我就不多说了

wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.8.tar.bz2

tar jxvf bwa-0.7.8.tar.bz2

cd bwa-0.7.8

make

一般公司都会用已经建立好的流程,不太会频繁更新软件,所以是VN:0.7.8-r455 版本的bwa工具,而我喜欢下载最新版,是VN:0.7.15-r1140

我制作了一个PE的fastq测序数据,如下:

## 首先是read1.fq

ST-E00142:330:H33KVALXX:5:1219:28280:57301 /1

ACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGCAGGCGGATCACCTGAGGTTAGGAGTTCAAGACCACCCTGGCCAACATGGTGAAACCCCATCTCTACTAAAAAACAAAAAATTAACTAGGTGTGGTGGCAGGCGCCTGTAATCCCA

+

AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJA-FJJAJJJJF<A-FJJA-<F<F-<FA7F77-F-77FA7F-7FJ<A---7AA7-F--AJ-FJJAAA<<AF-JJJFJJJJJJJFJJJJJJJ<JJJJFJJJFJJJJJJJJJJFJFFFJ

## 然后是read2.fq

ST-E00142:330:H33KVALXX:5:1219:28280:57301 /2

CCAGGAGGCAGAGGGTGCCAGTGATCCGAGATTGCGCCATTGCACTCCTTCTTAATGAAACGGCGCCCACCGCGATCTACACACGAACCCTACGCGCCGCTCTTCCGATCTACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGC

+

A-JF<)7FJAF<A7))<)<-<<JAFFAJFF--JJJFF-F-JF77F77-----A7--77-777F-7--7J7A7-7FF-JJ7<------7777A7----7F-77-7--<---7J<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAA

你们需要把两个fastq文件用上面的序列自己制作好;

然后测试两个版本的bwa结果,代码如下:

/home/jianmingzeng/biosoft/bwa/bwa-0.7.15/bwa mem  -M  /home/jianmingzeng/reference/index/bwa/hg19 read1.fq read2.fq 1>bwa-0.7.15.sam

/home/jianmingzeng/biosoft/bwa/bwa-0.7.8/bwa mem  -M  /home/jianmingzeng/reference/index/bwa/hg19 read1.fq read2.fq 1>bwa-0.7.8.sam

下面的内容我虽然复制粘贴在这里,但是我建议你弄到notepad++等编辑器里面仔细观看,最重要的是,你自己走一般这个过程,不然你根本不知道我在说什么。

VN:0.7.15-r1140比对结果是:

@PGID:bwaPN:bwaVN:0.7.15-r1140CL:/home/jianmingzeng/biosoft/bwa/bwa-0.7.15/bwa mem -M /home/jianmingzeng/reference/index/bwa/hg19 read1.fq read2.fq

ST-E00142:330:H33KVALXX:5:1219:28280:5730165chr1134633260150M=1346519188ACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGCAGGCGGATCACCTGAGGTTAGGAGTTCAAGACCACCCTGGCCAACATGGTGAAACCCCATCTCTACTAAAAAACAAAAAATTAACTAGGTGTGGTGGCAGGCGCCTGTAATCCCAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJA-FJJAJJJJF<A-FJJA-<F<F-<FA7F77-F-77FA7F-7FJ<A---7AA7-F--AJ-FJJAAA<<AF-JJJFJJJJJJJFJJJJJJJ<JJJJFJJJFJJJJJJJJJJFJFFFJNM:i:0MD:Z:150AS:i:150XS:i:92

ST-E00142:330:H33KVALXX:5:1219:28280:57301129chr113465194448M102S=1346332-188CCAGGAGGCAGAGGGTGCCAGTGATCCGAGATTGCGCCATTGCACTCCTTCTTAATGAAACGGCGCCCACCGCGATCTACACACGAACCCTACGCGCCGCTCTTCCGATCTACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGCA-JF<)7FJAF<A7))<)<-<<JAFFAJFF--JJJFF-F-JF77F77-----A7--77-777F-7--7J7A7-7FF-JJ7<------7777A7----7F-77-7--<---7J<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAANM:i:1MD:Z:14T33AS:i:43XS:i:35SA:Z:chr1,120410765,-,42M108S,0,0;XA:Z:chr3,+57932288,17M1I30M102S,2;

ST-E00142:330:H33KVALXX:5:1219:28280:57301401chr1120410765042M108H=1346332-119064475GCCCACCTTGGCCTCCCAAAGTGCTGGGATTACAGGCGTAGAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<J7--NM:i:0MD:Z:42AS:i:42XS:i:42SA:Z:chr1,1346519,+,48M102S,44,1;

VN:0.7.8-r455比对结果是:

@PGID:bwaPN:bwaVN:0.7.8-r455CL:/home/jianmingzeng/biosoft/bwa/bwa-0.7.8/bwa mem -M /home/jianmingzeng/reference/index/bwa/hg19 read1.fq read2.fq

ST-E00142:330:H33KVALXX:5:1219:28280:5730165chr1134633260150M=1346519188ACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGCAGGCGGATCACCTGAGGTTAGGAGTTCAAGACCACCCTGGCCAACATGGTGAAACCCCATCTCTACTAAAAAACAAAAAATTAACTAGGTGTGGTGGCAGGCGCCTGTAATCCCAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJA-FJJAJJJJF<A-FJJA-<F<F-<FA7F77-F-77FA7F-7FJ<A---7AA7-F--AJ-FJJAAA<<AF-JJJFJJJJJJJFJJJJJJJ<JJJJFJJJFJJJJJJJJJJFJFFFJNM:i:0MD:Z:150AS:i:150XS:i:0

ST-E00142:330:H33KVALXX:5:1219:28280:57301129chr113465194448M102S=1346332-188CCAGGAGGCAGAGGGTGCCAGTGATCCGAGATTGCGCCATTGCACTCCTTCTTAATGAAACGGCGCCCACCGCGATCTACACACGAACCCTACGCGCCGCTCTTCCGATCTACGCCTGTAATCCCAGCACTTTGGGAGGCCAAGGTGGGCA-JF<)7FJAF<A7))<)<-<<JAFFAJFF--JJJFF-F-JF77F77-----A7--77-777F-7--7J7A7-7FF-JJ7<------7777A7----7F-77-7--<---7J<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFAANM:i:1MD:Z:14T33AS:i:43XS:i:35SA:Z:chr1,120410765,-,42M108S,0,0;

ST-E00142:330:H33KVALXX:5:1219:28280:57301401chr1120410765042M108H=1346332-119064475GCCCACCTTGGCCTCCCAAAGTGCTGGGATTACAGGCGTAGAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ<J7--NM:i:0MD:Z:42AS:i:42XS:i:42SA:Z:chr1,1346519,+,48M102S,44,1;

我们都知道sam文件的第2列是flag,最正常应该是99和147,但是65和129是还算正常的PE reads的比对情况,分别代表左右两端

但是 401 这个东西我以前没有留意,查了一下,代表的是 not primary alignment

这里又涉及到了那个-M参数的意义:

The BWA-MEM algorithm performs local alignment. It may produce multiple primary alignments for different part of a query sequence. This is a crucial feature for long sequences. However, some tools such as Picard’s markDuplicates does not work with split alignments. One may consider to use option -M to flag shorter split hits as secondary.

而且里面还涉及到了H/S这两种比对结果:

我们也知道比对结果里面的H和S分别代表hard和soft的clipping,但是知道里面的具体含义的可能不多,如果是H模式,那么fastq会被截断后比对,这样bam文件里面看到的fastq就是一个部分序列,所以就不可能从bam文件里面还原出fastq序列啦。如果是S的话,虽然被截断的序列也是比对不说,但是在bam里面仍然会出现完整的fastq序列。

我这里已经回到了最开始我提出来的5个问题,我知道一般人看不懂!

我也没指望你们看懂,只有跟着做,动手的人才能看懂,就这样了,O(∩_∩)O谢谢

(0)

相关推荐