3000份水稻重测序数据全部公开
感觉最近接触的生物信息学知识越多,越对大数据时代的到来更有同感了。现在的研究者,其实很多都可以自己在家里做了,大量的数据基本都是公开的, 但是一个人闭门造车成就真的有限,与他人交流的思想碰撞还是蛮重要的。
https://aws.amazon.com/cn/blogs/aws/new-aws-public-data-set-3000-rice-genome/ https://aws.amazon.com/cn/public-data-sets/3000-rice-genome/ https://wiki.dnanexus.com/Featured-Projects/3000-rice-genomes
这里面列出了3000多份水稻全基因组测序数据,都共享在亚马逊云上面,是全基因组的双端测序数据,共3,024个水稻数据,比对到了五种不同的水稻参考基因组上面,而且主要是用GATK来找差异基因的。 而且,数据收集者还给出了一个snp calling的标准流程!
其采用的找变异流程如下:
SNP Pipeline Commands
1. Index the reference genome using bwa index
/software/bwa-0.7.10/bwa index /reference/japonica/reference.fa
2. Align the paired reads to reference genome using bwa mem.
Note: Specify the number of threads or processes to use using the -t parameter. The possible number of threads depends on the machine where the command will run.
/software/bwa-0.7.10/bwa mem -M -t 8 /reference/japonica/reference.fa /reads/filename_1.fq.gz /reads/filename_2.fq.gz > /output/filename.sam
3. Sort SAM file and output as BAM file
java -Xmx8g -jar /software/picard-tools-1.119/SortSam.jar INPUT=/output/filename.sam OUTPUT=/output/filename.sorted.bam VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE
4. Fix mate information
java -Xmx8g -jar /software/picard-tools-1.119/FixMateInformation.jar INPUT=/output/filename.sorted.bam OUTPUT=/output/filename.fxmt.bam SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE
5. Mark duplicate reads
java -Xmx8g -jar /software/picard-tools-1.119/MarkDuplicates.jar INPUT=/output/filename.fxmt.bam OUTPUT=/output/filename.mkdup.bam METRICS_FILE=/output/filename.metrics VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=TRUE MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
6. Add or replace read groups
java -Xmx8g -jar /software/picard-tools-1.119/AddOrReplaceReadGroups.jar INPUT=/output/filename.mkdup.bam OUTPUT=/output/filename.addrep.bam RGID=readname PL=Illumina SM=readname CN=BGI VALIDATION_STRINGENCY=LENIENT SO=coordinate CREATE_INDEX=TRUE
7. Create index and dictionary for reference genome
/software/samtools-1.0/samtools faidx /reference/japonica/reference.fa
java -Xmx8g -jar /software/picard-tools-1.119/CreateSequenceDictionary.jar REFERENCE=/reference/japonica/reference.fa OUTPUT=/reference/reference.dict
8. Realign Target
java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -o /output/filename.intervals -fixMisencodedQuals -nt 8
9. Indel Realigner
java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T IndelRealigner -fixMisencodedQuals -I /output/filename.addrep.bam -R /reference/japonica/reference.fa -targetIntervals /output/filename.intervals -o /output/filename.realn.bam
10. Merge individual BAM files if there are multiple read pairs per sample
/software/samtools-1.0/samtools merge /output/filename.merged.bam /output/*.realn.bam
11. Call SNPs using Unified Genotyper
java -Xmx8g -jar /software/GenomeAnalysisTK-3.2-2/GenomeAnalysisTK.jar -T UnifiedGenotyper -R /reference/japonica/reference.fa -I /output/filename.merged.bam -o filename.merged.vcf -glm BOTH -mbq 20 --genotyping_mode DISCOVERY -out_mode EMIT_ALL_SITES