小鼠全基因组数据分析
文章是:2017 Unexpected mutations after CRISPR–Cas9 editing in vivo
We performed WGS on a CRISPR–Cas9-edited mouse to identify all off-target mutations and found an unexpectedly high number of SNVs compared with the widely accepted assumption that CRISPR causes mostly indels at regions homologous to the sgRNA.
发表于 Nature Methods volume14, pages547–548 (2017) 作者已经撤稿,但是数据仍然在!
This article was retracted on 27 April 2018
Updated online 14 June 2017
Corrected online 25 July 2017
Retracted online 30 March 2018
DNA was isolated from two CRISPR-repaired mice (F03 and F05) and one uncorrected control
数据上传了:Bioproject (accession PRJNA382177). Sequencing data available at SRA:
FVB mouse (accession SRR5450998)
F03 mouse (accession SRR5450997),
F05 mouse (accession SRR5450996).
有点类似于肿瘤外显子的数据分析流程:
As additional controls, each of the variants was compared with the FVB/NJ genome in the mouse dbSNP database (v138), and each of the SNVs was also compared with all 36 strains in the Mouse Genome Project (v3).
首先下载数据
根据作者给出的ID号
下载:
cat srr.list |while read id;do (nohup ~/biosoft/sratoolkit/sratoolkit.2.8.2-1-centos_linux64/bin/prefetch $id -X 100G & );done
下载得到的sra文件如下:
67G Jun 29 18:58 SRR5450996.sra
69G Jun 29 18:51 SRR5450997.sra
39G Jun 29 18:15 SRR5450998.sra
sra转换为fastq后如下:
45G Jun 30 13:29 control_1.fastq.gz
47G Jun 30 13:29 control_2.fastq.gz
46G Jun 30 13:26 F03_1.fastq.gz
48G Jun 30 13:26 F03_2.fastq.gz
27G Jun 30 09:24 F05_1.fastq.gz
28G Jun 30 09:24 F05_2.fastq.gz
简单的使用trim_galore进行过滤后如下;
43G Jul 1 10:22 control_1_val_1.fq.gz
44G Jul 1 10:22 control_2_val_2.fq.gz
43G Jul 1 10:29 F03_1_val_1.fq.gz
45G Jul 1 10:29 F03_2_val_2.fq.gz
26G Jul 1 03:18 F05_1_val_1.fq.gz
27G Jul 1 03:18 F05_2_val_2.fq.gz
看起来没有啥太多数据被过滤,说明作者此次测序数据质量还不错!
小鼠WGS数据分析准备工作
一般来说,可以选择最新版小鼠参考基因组(mm10)了,如果你实在有其它需求,也可以自行选择其它版本。
遗憾的是GATK官网不提供小鼠相关资源:ftp://ftp.broadinstitute.org/bundle
所以只能去illumina官网下载咯: https://support.illumina.com/sequencing/sequencing_software/igenome.html
mkdir -p ~/biosoft/GATK/resources/bundle/mm10
cd ~/biosoft/GATK/resources/bundle/mm10
wget ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Mus_musculus/UCSC/mm10/Mus_musculus_UCSC_mm10.tar.gz
nohup tar zxvf Mus_musculus_UCSC_mm10.tar.gz &
下载全部成功后,应该如下:
这个 Mus_musculus_UCSC_mm10.tar.gz &
压缩包里面的数据文件实在是太多,我没办法列出来秀给大家了。
如果有要做 RealignerTargetCreator 需要下载dbsnp文件,参考: https://www.biostars.org/p/182917/
这个教程里面提到的文件非常之多,可以ftp登录看看,用户名就用guest即可。
cd ~/biosoft/GATK/resources/bundle/mm10
mkdir dbsnp
cd dbsnp
ftp ftp-mouse.sanger.ac.uk
cd current_indels/strain_specific_vcfs/
ls -lh
lcd ./
prompt off
mget **
可以下载得到个小鼠品系的vcf文件如下:
225.0M May 1 2015 129P2_OlaHsd.mgp.v5.snps.dbSNP142.vcf.gz
201.7M May 1 2015 129S1_SvImJ.mgp.v5.snps.dbSNP142.vcf.gz
206.7M May 1 2015 129S5SvEvBrd.mgp.v5.snps.dbSNP142.vcf.gz
190.8M May 1 2015 A_J.mgp.v5.snps.dbSNP142.vcf.gz
193.9M May 1 2015 AKR_J.mgp.v5.snps.dbSNP142.vcf.gz
177.7M May 1 2015 BALB_cJ.mgp.v5.snps.dbSNP142.vcf.gz
165.8M May 1 2015 BTBR_T__Itpr3tf_J.mgp.v5.snps.dbSNP142.vcf.gz
199.1M May 1 2015 BUB_BnJ.mgp.v5.snps.dbSNP142.vcf.gz
167.2M May 1 2015 C3H_HeH.mgp.v5.snps.dbSNP142.vcf.gz
194.8M May 1 2015 C3H_HeJ.mgp.v5.snps.dbSNP142.vcf.gz
16.5M May 1 2015 C57BL_10J.mgp.v5.snps.dbSNP142.vcf.gz
6.8M May 1 2015 C57BL_6NJ.mgp.v5.snps.dbSNP142.vcf.gz
101.6M May 1 2015 C57BR_cdJ.mgp.v5.snps.dbSNP142.vcf.gz
100.2M May 1 2015 C57L_J.mgp.v5.snps.dbSNP142.vcf.gz
108.4M May 1 2015 C58_J.mgp.v5.snps.dbSNP142.vcf.gz
749.4M May 1 2015 CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
198.0M May 1 2015 CBA_J.mgp.v5.snps.dbSNP142.vcf.gz
192.9M May 1 2015 DBA_1J.mgp.v5.snps.dbSNP142.vcf.gz
197.9M May 1 2015 DBA_2J.mgp.v5.snps.dbSNP142.vcf.gz
196.8M May 1 2015 FVB_NJ.mgp.v5.snps.dbSNP142.vcf.gz
206.4M May 1 2015 I_LnJ.mgp.v5.snps.dbSNP142.vcf.gz
216.4M May 1 2015 KK_HiJ.mgp.v5.snps.dbSNP142.vcf.gz
234.6M May 1 2015 LEWES_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
206.0M May 1 2015 LP_J.mgp.v5.snps.dbSNP142.vcf.gz
698.2M May 1 2015 MOLF_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
204.6M May 1 2015 NOD_ShiLtJ.mgp.v5.snps.dbSNP142.vcf.gz
201.3M May 1 2015 NZB_B1NJ.mgp.v5.snps.dbSNP142.vcf.gz
202.8M May 1 2015 NZO_HlLtJ.mgp.v5.snps.dbSNP142.vcf.gz
212.4M May 1 2015 NZW_LacJ.mgp.v5.snps.dbSNP142.vcf.gz
728.8M May 1 2015 PWK_PhJ.mgp.v5.snps.dbSNP142.vcf.gz
203.2M May 1 2015 RF_J.mgp.v5.snps.dbSNP142.vcf.gz
181.4M May 1 2015 SEA_GnJ.mgp.v5.snps.dbSNP142.vcf.gz
1.5G May 1 2015 SPRET_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
197.1M May 1 2015 ST_bJ.mgp.v5.snps.dbSNP142.vcf.gz
271.7M May 1 2015 WSB_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
267.7M May 1 2015 ZALENDE_EiJ.mgp.v5.snps.dbSNP142.vcf.gz
这些vcf文件的理解,需要对小鼠这个实验动物背景有一点了解,实际上这个时候我们需要的vcf文件应该是来自于dbSNP数据库的,需要需要的是dbsnp的rs ID号。 ftp://ftp.ncbi.nih.gov/snp/
cd ~/biosoft/GATK/resources/bundle/mm10/dbsnp
wget ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz
wget ftp://ftp.ncbi.nih.gov/snp/organisms/archive/mouse_10090/VCF/00-All.vcf.gz.tbi
下载的vcf文件也需要仔细理解哦:
##fileformat=VCFv4.0
##source=dbSNP
##dbSNP_BUILD_ID=150
##reference=GCF_000001635.24
##variationPropertyDocumentationUrl=ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP">
##INFO=<ID=RV,Number=0,Type=Flag,Description="RS orientation is reversed">
##INFO=<ID=VP,Number=1,Type=String,Description="Variation Property. Documentation is at ftp://ftp.ncbi.nlm.nih.gov/snp/specs/dbSNP_BitField_latest.pdf">
##INFO=<ID=GENEINFO,Number=.,Type=String,Description="Pairs each of gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=dbSNPBuildID,Number=1,Type=Integer,Description="First dbSNP Build for RS">
##INFO=<ID=SAO,Number=1,Type=Integer,Description="Variant Allele Origin: 0 - unspecified, 1 - Germline, 2 - Somatic, 3 - Both">
##INFO=<ID=VC,Number=1,Type=String,Description="Variation Class">
GATK流程
gatk只是上游流程,
touch config.sra config.raw config.clean config.bam readme.txt
mkdir -p {sra,raw_fastq,clean_fastq,alignment,somatic,germline,cnv,logs,qc,ccf,qualimap}
mkdir -p somatic/{mutect2,varscan,muse}
mkdir -p cnv/{cnvkit,gatk4,sequenza,gistic2}
mkdir -p ccf/{ABSOLUTE,THetA2,PureCN,PyClone,BubbleTree}
mkdir -p qc/{align,clean,counts,raw,recal,trim,somatic,germline,cnv,qualimap}
比对以及GATK中间流程的数据太耗费空间,就不保留了,而且非常好内存。
首先看比对成sam后的bam文件,大小是:
138G Jul 6 23:55 control.bam
139G Jul 7 00:12 F03.bam
83G Jul 6 04:47 F05.bam
接下来 25G的内存已经是不够用了,报错如下:
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Runtime.totalMemory()=19129171968
[Sat Jul 07 05:45:29 CST 2018] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 85.82 minutes.
slurmstepd: *** JOB 164819 ON compute-2-5 CANCELLED AT 2018-07-07T05:44:49 ***
slurmstepd: Exceeded job memory limit
slurmstepd: Job 164819 exceeded memory limit (26735988 > 26214400), being killed
调整到35G的内存,终于可以成功运行,得到的bam文件如下:
123G Jul 9 20:00 control_marked_fixed.bam
125G Jul 9 21:20 F03_marked_fixed.bam
80G Jul 9 16:44 F05_marked_fixed.bam
接下来的重头戏是 GATK的BaseRecalibrator
和 HaplotypeCaller
,可以参考:https://www.jianshu.com/p/49d035b121b8
for sample in {control,F03,F05};
do
echo $sample
ls -lh ${sample}_marked_fixed.bam
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" BaseRecalibrator \
-R $ref \
-I ${sample}_marked_fixed.bam \
--known-sites $snp \
--known-sites $indel \
-O ${sample}_recal.table \
1>${sample}_log.recal
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" ApplyBQSR \
-R $GENOME \
-I ${sample}_marked_fixed.bam \
-bqsr ${sample}_recal.table \
-O ${sample}_bqsr.bam \
1>${sample}_log.ApplyBQSR
## 使用GATK的HaplotypeCaller命令
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" HaplotypeCaller \
-R $ref \
-I ${sample}_bqsr.bam \
--dbsnp $snp \
-O ${sample}_raw.vcf \
1>${sample}_log.HC
done
其实这样的shell脚本是很烂的, 因为这个小鼠全基因组数据太大,如果这样一个步骤接着一个步骤,一个样本接着一个样本的运行,大半个月就过去了。
而且上面的数据都是非常大的,可以选择小批量数据进行测试脚本:
mkdir test
samtools view -h control_marked_fixed.bam |head -100000 |samtools sort -o test/control_marked_fixed.bam -
cd test
module load java/1.8.0_91
GENOME=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/Mus_musculus/UCSC/mm10/Sequence/WholeGenomeFasta/genome.fa
ref=$GENOME
INDEX=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/Mus_musculus/UCSC/mm10/Sequence/BWAIndex/genome.fa
GATK=/home/jianmingzeng/biosoft/GATK/gatk-4.0.2.1/gatk
dbsnpPath=/home/jianmingzeng/biosoft/GATK/resources/bundle/mm10/dbsnp
snp1=${dbsnpPath}/129S5SvEvBrd.mgp.v5.snps.dbSNP142.vcf.gz
snp2=${dbsnpPath}/C57BL_6NJ.mgp.v5.snps.dbSNP142.vcf.gz
snp3=${dbsnpPath}/I_LnJ.mgp.v5.snps.dbSNP142.vcf.gz
indel1=${dbsnpPath}/129S5SvEvBrd.mgp.v5.indels.dbSNP142.normed.vcf.gz
indel2=${dbsnpPath}/C57BL_6NJ.mgp.v5.indels.dbSNP142.normed.vcf.gz
indel3=${dbsnpPath}/I_LnJ.mgp.v5.indels.dbSNP142.normed.vcf.gz
sample=control
$GATK --java-options "-Xmx20G -Djava.io.tmpdir=./" BaseRecalibrator \
-R $ref \
-I ${sample}_marked_fixed.bam \
--known-sites $snp1 --known-sites $snp2 --known-sites $snp3 \
--known-sites $indel1 --known-sites $indel2 --known-sites $indel3 \
-O ${sample}_recal.table \
1>${sample}_log.recal
尤其值得注意的是染色体标记错误:
A USER ERROR has occurred: Input files reference and features have incompatible contigs: No overlapping contigs found.
reference contigs = [chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chrM, chrX, chrY]
features contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 3, 4, 5, 6, 7, 8, 9, MT, X, Y]
也就是说我们给的vcf文件里面的染色体是没有chr这个前缀,可是我们给的参考基因组里面却有这个前缀。
GATK的gvcf模式
这里有3只小鼠需要进行基因型比较,所以gvcf模式比较合适,使用 GATK的 Joint Calling , 过滤参考:https://mp.weixin.qq.com/s/W8Vfv1WmW6M7U0tIcPtlng
因为学校服务器最近总是出问题,所以这个流程只能是半成品,待续。
生信技能树GATK4系列教程
然后是 CNV相关工具
还有vcf和maf的工具: