区间运算
昨日还是前日,朋友在python小群分享了其在stackoverflow?上面看到的关于区间运算的一些讨论...我翻了下,没太多耐心就没有看下去了....
关于区间运算,我觉得是一个很有趣的东西,
最简单的一个例子就是,给定两个区间,
(1,10),(8,20)
my @first = (1,10);
my @second = (8,20);
暂时不考虑坐标是 先大后小 的,看两个区间是否有重叠
print &isOverlap(@first,@second);
sub isOverlap{
($firstStart, $firstEnd, $secondStart, $secondEnd) = @_;
if($firstEnd < $secondStart || $firstStart > $secondEnd){
return 0; # 区间之外
}else{
return 1;
}
}
如果需要考虑重叠的比例
print &getMaxOverlapRate(@first,@second);
sub getMaxOverlapRate{
($firstStart, $firstEnd, $secondStart, $secondEnd) = @_;
# 为了条理清晰,写多了一些中间变量
my $max = getMax($firstEnd,$secondEnd);
my $min = getMin($firstStart,$secondStart);
my $totalLen = $max - $min;
my $firstLen = $firstEnd - $firstStart;
my $secondLen = $secondEnd - $secondStart;
my $mergedLen = $firstLen + $secondLen;
my $OverlapLen = $mergedLen-$totalLen;
return $firstLen<$secondLen?$OverlapLen/$firstLen:$OverlapLen/$secondLen;
}
sub getMax{
$first=shift;
$second = shift;
return $first>$second?$first:$second;
}
sub getMin{
$first=shift;
$second = shift;
return $first<$second?$first:$second;
}
很久以前做过一个重测序鉴定SV,然后提取SV对应的注释信息的...那个就只是两个区间的,当时需要判断 基因被SV覆盖的比例,来确定是否算是缺失的基因...
最近做小RNA的也有类似的问题,不过更麻烦一点,大体情况是,
小RNA预测,如果是一个样本的话,那么直接一个样本回帖预测,如果是多个样本的话,那么大多数人可能是多个样本合并之后进行预测,这样就只得到一个预测结果....而我太过聪明,所有样本分开预测,这样就出问题了,出现很多个预测结果,那么需要合并这些预测结果。
小RNA预测,和RNAseq的有参组装类似,组装出来的结果需要merge..但是merge也需要控制overlap比例,如果只是一两个base的Overlap,那个其实并没有太大意义,或许是miRNA cluster呢?
所以就大概写了一个脚本,用来合并N个smallRNA测序结果中,miRDP...的预测结果...恩虽然我已经修改可miRDP的一些判断规则和逻辑。。。毕竟....太宽松了不好....
===========================具体如下=====================
#!/usr/bin/env perl
usestrict;
my$usage="
perl $0 float sample_1_P_predictionsample_2_P_prediction [sample_3_P_prediction ...] > OutMerged_P_prediction
# 输入参数:
float 设置Overlap的比例,一般是0.8
sample_1_P_predictionsample_2_P_prediction [sample_3_P_prediction ...] 一系列预测结果
# 输出结果:
输出一张merged Table, 每一个预测出来的miR都有在最后一列有编号,相同编号的则有Overlap
# Merged条件:
如果在相同染色体上,成熟体位置的交集长度超过短的成熟体长度的80%,
那么判断为相同的miRNA,直接保留
# 写一个脚本,专门用于整合同一个参考基因组,使用miRDeep-P预测出来的miRNA
";
die$usageunless@ARGV;
my$ratio=shift;
for(@ARGV){
-s$_ordie"can not find $_\n$usage\n";
}
my$groupCounts='aaa';# 用于最终Counts的
my%chrSeen;
for(@ARGV){
my$file=$_;
openIN,'<',$fileordie"Can't read file $file\n";
while(<IN>){
chomp;
nextif$.==1;# 过滤掉头部行
my($ChrID,$Strand,$RepresentativeSeqID,$SubChrID,$MatureMIRregion,$PercusorMIRregion,$MatureSeq,$PercusorSeq,$Score,$TotalFreq,$MarueFreq,$StarFreq,$StrandBias)=split/\t/,$_;
my($curStart,$curEnd)=split /\.\./,$MatureMIRregion;
if($chrSeen{$ChrID}){
# 获取所有区域该染色体上非冗余成熟miRNA的区域
my@regionRefArr=keys%{$chrSeen{$ChrID}};
# 查看是否有成熟miRNA交集超过0.8的
my$isOverlap= 0;
for(@regionRefArr){
my($start,$end,$strand,$groupID)=split/\t/,$_;
# 如果是相同一条链上的
if($strand eq $Strand){
# 如果在相同染色体上,成熟体位置的交集长度超过短的成熟体长度的80%
if(&getOverlapRate($start,$end,$curStart,$curEnd)>=$ratio){
# 直接有Overlap 就直接跳出
$isOverlap= 1;
push@{$chrSeen{$ChrID}->{"$curStart\t$curEnd\t$Strand\t$groupID"}},[$ChrID,$Strand,$RepresentativeSeqID,$SubChrID,$MatureMIRregion,$PercusorMIRregion,$MatureSeq,$PercusorSeq,$Score,$TotalFreq,$MarueFreq,$StarFreq,$StrandBias,$groupID];
last;
}
}
}
if($isOverlap){
# 如果Overlap的话,就不用管了
# 因为上面已经记录了
}else{
# 如果没有Overlap的话
$groupCounts++;
push@{$chrSeen{$ChrID}->{"$curStart\t$curEnd\t$Strand\t$groupCounts"}},[$ChrID,$Strand,$RepresentativeSeqID,$SubChrID,$MatureMIRregion,$PercusorMIRregion,$MatureSeq,$PercusorSeq,$Score,$TotalFreq,$MarueFreq,$StarFreq,$StrandBias,$groupCounts];
}
}else{
# 记录新的Chr上面的记录
$groupCounts++;
push@{$chrSeen{$ChrID}->{"$curStart\t$curEnd\t$Strand\t$groupCounts"}},[$ChrID,$Strand,$RepresentativeSeqID,$SubChrID,$MatureMIRregion,$PercusorMIRregion,$MatureSeq,$PercusorSeq,$Score,$TotalFreq,$MarueFreq,$StarFreq,$StrandBias,$groupCounts];
}
}
closeIN;
}
# 输出Header
print"ChrID\tStrand\tRepresentativeSeqID\tSubChrID\tMatureMIRregion\tPercusorMIRregion\tMatureSeq\tPercusorSeq\tScore\tTotalFreq\tMarueFreq\tStarFreq\tStrandBias\tOverlapIDs\n";
for(sortkeys%chrSeen){
my$chrId=$_;
for(sort{(split/\t/,$a)[0]<=>(split/\t/,$b)[0]}keys%{$chrSeen{$chrId}}){
for(@{$chrSeen{$chrId}->{$_}}){
printjoin"\t",@{$_},"\n";
}
}
}
sub getOverlapRate{
my($start_1,$end_1,$start_2,$end_2)=@_;
# 这个一定是排序好的,
my$minStart=$start_1;
if($start_2<$start_1){
$minStart=$start_2;
}
my$maxEnd=$end_1;
if($end_2>$end_1){
$maxEnd=$end_2;
}
my$len_1=$end_1-$start_1;
my$len_2=$end_2-$start_2;
my$overlapLen=$end_1-$start_1+$end_2-$start_2-($maxEnd-$minStart);
# my $ratio =($end_1-$start_1+$end_2-$start_2)/($maxEnd-$minStart);
# 如果这个比值 ==1 那么就是刚好没有Overlap
# 如果这个比值 >1 那么就是 $ratio - 1 就是总体跨越区间的 Overlap 比例
# 如果这个比值 <1 那么就是没有Overlap
if($overlapLen<= 0){
return 0;
}else{
$len_2>$len_1?return$overlapLen/$len_1:return$overlapLen/$len_2;
}
}
__END__
# 获取最高频的 miRNA
perl -lane '$count{$F[-1]}->{$F[6]}++;END{for(sort keys %count){print"$_\t",(sort {$a<=>$b} keys %{$count{$_}})[0]}}'merged.prediction.raw > hookIdandSeq.ids
perl -F'\t' -lane 'if($flag){print if $seen{"$F[-1]\t$F[6]"} and$seen{"$F[-1]\t$F[6]"}++<2}else{$seen{"$F[0]\t$F[1]"}=1}$flag=1if eof(ARGV)' hookIdandSeq.ids merged.prediction.raw >merged.prediction.final