用NGSCheckMate检查两个测序数据是否来自于同一个样本

NGSCheckMate是2017年发表在Nucleic Acids Res. 的工具,链接https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5499645/ 思路很有趣,值得细看。 工具在GitHub里面:https://github.com/parklab/NGSCheckMate

Here, we describe NGSCheckMate, a user-friendly software package for verifying sample identities from FASTQ, BAM or VCF files.

应用范围非常广,包括各种NGS数据,而且测序深度要求很低:

Our evaluation shows that NGSCheckMate is effective for a variety of data types, including exome sequencing, whole-genome sequencing, RNA-seq, ChIP-seq, targeted sequencing and single-cell whole-genome sequencing, with a minimal requirement for sequencing depth (>0.5X).

原理很简单,就是把所有待分析的样本的SNP位点找到,根据已知有广泛差异的位点来对这些样本进行聚类,看看是否同一个样本的不同测序数据会聚集在一起。

这个需求早在3年前,我们公司美国总部的工程师就写过perl程序来做类似的事情,但是公司一般不会公布软件来发表文章,可惜我那个时候也没有什么科研思维,但也没有想到,居然是直到2017年才有人发表,这么迫切的需求才被解决,不可思议。

参考snp位点

这个非常重要,不管是何种NGS数据,测定的区域都很广泛,需要指定特定的,能把不同样本区分开的某些SNP位点,作者只给出了人类hg19的的,如下:

Among the SNP set (version 138) downloaded from dbSNP, we selected 21 067 exonic SNPs that are variable across individuals to construct a set of informative features for individual identification. Specifically, we calculated the VAF of every SNP in the dbSNP set using 40 germline WGS profiles from TCGA stomach cancer patients and selected SNPs whose median absolute deviation of the SNP VAF across samples is larger than zero. The resulting 21 067 exonic SNPs served as a reference set to measure VAF correlation between input files.

软件下载地址里面有这个文件。

算法:kmer方法

可以说算法是很简单的,

The alignment-free method is designed to obtain read counts for each SNP by scanning reads in FASTQ files to search for a k-mer sequence that spans the SNP locus either at the center or at one of the two ends (k = 21 for the current version). Both forward and reverse complementary sequences were included in the k-mer set. To ensure that each k-mer uniquely represents an SNP and its allele, we exclude SNPs for which the k-mers with the reference allele do not uniquely map to the reference genome or for which the k-mers with an alternative allele map to the reference genome. We used only perfect matches. Each k-mer (the hash key) derived from the remaining SNPs is stored in a hash table and points to the read count of the SNP and its allele type (the hash value).

安装及配置软件

首先是从GitHub上面下载并且安装

mkdir -p ~/biosoft/ngscheckmate
cd ~/biosoft/ngscheckmate
git clone https://github.com/parklab/NGSCheckMate.git
## set NCM_HOME according to you shell environment
## for example, when using bash, add the following in your .bashrc
vi ~/.bashrc
#export NCM_HOME=c
source ~/.bashrc
echo $NCM_HOME

一般我们都是从bam文件开始来使用这个程序,所以需要修改配置文件

vi /home/jianmingzeng/biosoft/ngscheckmate/NGSCheckMate/ncm.conf

因为 samtools 和bcftools一般都是放在环境变量,所以不需要修改,主要是看 参考基因组的问题。

那么运行的方法如下:

source activate py27
python ~/biosoft/ngscheckmate/NGSCheckMate/ncm.py  -B -f \
-l bam_path.txt \
-bed ~/biosoft/ngscheckmate/NGSCheckMate/SNP/SNP_GRCh37_hg19_wChr.bed  -O "./"

hg38版本问题

需要进行转换, 我以前博客讲解过,很简单的:

http://www.bio-info-trainee.com/1413.html

http://www.bio-info-trainee.com/990.html

我这里就选择 CrossMap

source activate py27
pip install CrossMap
python ~/.local/bin/CrossMap.py --help
cd /home/jianmingzeng/biosoft/ngscheckmate/NGSCheckMate/SNP
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
python ~/.local/bin/CrossMap.py  bed hg19ToHg38.over.chain.gz SNP_GRCh37_hg19_wChr.bed|cut -f 8-13>SNP_GRCh38_hg38_wChr.bed

记得版本转换要去除空行哦。

假如你运行其它物种

这个时候软件并没有提供你参考snp数据集,你可以模仿作者的思路去制作,也可以直接把多个样本的vcf文件自己制作出来,然后取共有位点。

比如下面的小鼠的:

cat  /vcf/*.variants_filtered.vcf |grep -v '^#' |cut -f 1-5 |perl -alne '{next if length($F[3])>1; next if length($F[4])>1;$end=$F[1]+1;print "$F[0]\t$F[1]\t$end\t$F[2]\t$F[3]\t$F[4]"}'|sort -u > ~/biosoft/ngscheckmate/NGSCheckMate/SNP/mm10.bed

source activate py27
nohup python /home/jianmingzeng/biosoft/ngscheckmate/NGSCheckMate/ncm.py  -B -f \
-l ~/data/project/check_mouse/bam_path.txt \
-bed /home/jianmingzeng/biosoft/ngscheckmate/NGSCheckMate/SNP/mm10.bed  -O test &

(0)

相关推荐