5 一步法找基因变异流程
1 先查看sam文件
随机选择3个
$ samtools mpileup SRR8517854.bam |head -95|tail -3
[mpileup] 1 samples in 1 input files
chr1 10105 N 8 AAAAcAAA kuuu>6uK
chr1 10106 N 10 CCCCcCCCC^$c uuuukAuuAK
chr1 10107 N 10 CCCCcCCCCc upukaFfuKF
再看另外一个:
$ samtools mpileup SRR8517854.bam |head -168944|tail -3
[mpileup] 1 samples in 1 input files
chr1 909515 N 1 c K
chr1 909516 N 1 c K
chr1 909517 N 1 g K
关于samtools mpileup的具体用法和结果解释请看
2 最简单的找变异流程
align文件夹
ref=/mnt/f/kelly/bioTree/server/wesproject/hg38/Homo_sapiens_assembly38.fasta
time samtools mpileup -ugf $ref *.bam|bcftools call -vmO z -o wxs_out.vcf.gz
less -S wxs_out.vcf.gz
3载入IGV查看
3.1先构建索引
批量构建
ls *.bam|xargs -i samtools index {}
如果是服务器需要下载到本地查看
4 去除PCR重复
需要samtools的4个命令
align文件夹
samtools markdup -r SRR7696207.bam SRR7696207.rm.bam
[markdup] error: no ms score tag. Please run samtools fixmate on file first.
[markdup] error: no ms score tag. Please run samtools fixmate on file first.
提示先fixmate
$ samtools fixmate SRR7696207.bam SRR7696207.fixmate.bam
[bam_mating_core] ERROR: Coordinate sorted, require grouped/sorted by queryname.
提示先要sort,并且要以queryname进行sort
$ samtools sort -n -o SRR7696207.sort.bam SRR7696207.bam
$ samtools sort -n -o SRR7696207.namesort.bam SRR7696207.bam
[bam_sort_core] merging from 25 files and 1 in-memory blocks...
接下来就可以逆回去了,也就是是正确的顺序(本篇最下方有说明)
$ samtools sort -n -o SRR7696207.namesort.bam SRR7696207.bam
$ samtools fixmate -m SRR7696207.namesort.bam SRR7696207.fixmate.bam
$ samtools sort -o SRR7696207.positionsort.fixmat.bam SRR7696207.fixmate.bam
$ samtools markdup -r SRR7696207.positionsort.fixmat.bam SRR7696207.rm.bam
看下以上的文件结构和大小
├── [ 136] 1
├── [ 136] 2
├── [ 0] config
├── [ 42K] markdup.bam
├── [ 22G] SRR7696207.bam
├── [5.4G] SRR7696207.fixmate.bam
├── [5.2G] SRR7696207.namesort.bam
├── [4.0G] SRR7696207.positionsort.fixmat.bam
├── [3.8G] SRR7696207.rm.bam
├── [2.6G] SRR8517853.bam
├── [3.4G] SRR8517854.bam
└── [6.9G] SRR8517856.bam
最后查看下rm.bam
samtools view SRR7696207.rm.bam |less -S
SRR7696207.27107225 2193 chr1 10024 0 86M64H chr5 10324 0 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
SRR7696207.27728977 99 chr1 10025 17 91M18S = 10025 91 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
SRR7696207.27728977 147 chr1 10025 17 91M18S = 10025 -91 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA
SRR7696207.4339191 163 chr1 10027 0 69M = 10039 123 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
SRR7696207.12487651 163 chr1 10028 0 86M = 10028 81 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
SRR7696207.9370306 83 chr1 10028 0 116M = 10016 -128 CCCTAACCCTAACCCTAAACCTAACCCTAACCCTAACC
SRR7696207.12487651 83 chr1 10028 0 69S81M = 10028 -81 TTTCTAGCAGTGGACTGCATACGTGTTCGCATACTGTG
SRR7696207.18904916 99 chr1 10030 0 81M69S = 10030 79 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
SRR7696207.18904916 147 chr1 10030 0 79M = 10030 -79 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT
SRR7696207.4339191 83 chr1 10039 0 111M7S = 10027 -123 ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC
SRR7696207.20389847 99 chr1 10040 0 74M = 10040 74 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACC
SRR7696207.20389847 147 chr1 10040 0 76S74M = 10040 -74 CATATTTTTGTTTTTTTTAATGTTACGGCGACCACCGA
看rm后的是否有差异
samtools flagstat SRR7696207.bam
samtools flagstat SRR7696207.rm.bam
结果如下
$ samtools flagstat SRR7696207.bam
55398860 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
372636 + 0 supplementary
0 + 0 duplicates
55374129 + 0 mapped (99.96% : N/A)
55026224 + 0 paired in sequencing
27513112 + 0 read1
27513112 + 0 read2
54512924 + 0 properly paired (99.07% : N/A)
54978908 + 0 with itself and mate mapped
22585 + 0 singletons (0.04% : N/A)
330146 + 0 with mate mapped to a different chr
252082 + 0 with mate mapped to a different chr (mapQ>=5)
$ samtools flagstat SRR7696207.rm.bam
52114377 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
372636 + 0 supplementary
0 + 0 duplicates
52089646 + 0 mapped (99.95% : N/A)
51741741 + 0 paired in sequencing
25868532 + 0 read1
25873209 + 0 read2
51246880 + 0 properly paired (99.04% : N/A)
51699203 + 0 with itself and mate mapped
17807 + 0 singletons (0.03% : N/A)
319884 + 0 with mate mapped to a different chr
242709 + 0 with mate mapped to a different chr (mapQ>=5)
可以再试试-S
参数
samtools markdup -S SRR7696207.sam SRR7696207.mk.bam
补充:samtoolsmarkdup
操作的正确顺序
- The first sort can be omitted if the file is already name ordered
samtools sort -n -o namesort.bam example.bam
- Add ms and MC tags for markdup to use later
samtools fixmate -m namesort.bam fixmate.bam
- Markdup needs position order
samtools sort -o positionsort.bam fixmate.bam
- Finally mark duplicates
samtools markdup positionsort.bam markdup.bam
以上是简单的找变异流程,并不完善。下面是GATK找变异。
赞 (0)