区间运算

昨日还是前日,朋友在python小群分享了其在stackoverflow?上面看到的关于区间运算的一些讨论...我翻了下,没太多耐心就没有看下去了....

关于区间运算,我觉得是一个很有趣的东西,

最简单的一个例子就是,给定两个区间,

(1,10),(8,20)

my @first = (1,10);

my @second = (8,20);

暂时不考虑坐标是 先大后小 的,看两个区间是否有重叠

print &isOverlap(@first,@second);

sub isOverlap{

($firstStart, $firstEnd, $secondStart, $secondEnd) = @_;

if($firstEnd < $secondStart || $firstStart > $secondEnd){

return 0;  # 区间之外

}else{

return 1;

}

}

如果需要考虑重叠的比例

print &getMaxOverlapRate(@first,@second);

sub getMaxOverlapRate{

($firstStart, $firstEnd, $secondStart, $secondEnd) = @_;

# 为了条理清晰,写多了一些中间变量

my $max = getMax($firstEnd,$secondEnd);

my $min = getMin($firstStart,$secondStart);

my $totalLen = $max - $min;

my $firstLen =  $firstEnd - $firstStart;

my $secondLen = $secondEnd - $secondStart;

my $mergedLen =  $firstLen + $secondLen;

my $OverlapLen = $mergedLen-$totalLen;

return $firstLen<$secondLen?$OverlapLen/$firstLen:$OverlapLen/$secondLen;

}

sub getMax{

$first=shift;

$second = shift;

return $first>$second?$first:$second;

}

sub getMin{

$first=shift;

$second = shift;

return $first<$second?$first:$second;

}

很久以前做过一个重测序鉴定SV,然后提取SV对应的注释信息的...那个就只是两个区间的,当时需要判断 基因被SV覆盖的比例,来确定是否算是缺失的基因...

最近做小RNA的也有类似的问题,不过更麻烦一点,大体情况是,

小RNA预测,如果是一个样本的话,那么直接一个样本回帖预测,如果是多个样本的话,那么大多数人可能是多个样本合并之后进行预测,这样就只得到一个预测结果....而我太过聪明,所有样本分开预测,这样就出问题了,出现很多个预测结果,那么需要合并这些预测结果。

小RNA预测,和RNAseq的有参组装类似,组装出来的结果需要merge..但是merge也需要控制overlap比例,如果只是一两个base的Overlap,那个其实并没有太大意义,或许是miRNA cluster呢?

所以就大概写了一个脚本,用来合并N个smallRNA测序结果中,miRDP...的预测结果...恩虽然我已经修改可miRDP的一些判断规则和逻辑。。。毕竟....太宽松了不好....

===========================具体如下=====================

#!/usr/bin/env perl

usestrict;

my$usage="

perl $0 float sample_1_P_predictionsample_2_P_prediction [sample_3_P_prediction ...] > OutMerged_P_prediction

# 输入参数:

float 设置Overlap的比例,一般是0.8

sample_1_P_predictionsample_2_P_prediction [sample_3_P_prediction ...]   一系列预测结果

# 输出结果:

输出一张merged Table, 每一个预测出来的miR都有在最后一列有编号,相同编号的则有Overlap

# Merged条件:

如果在相同染色体上,成熟体位置的交集长度超过短的成熟体长度的80%,

那么判断为相同的miRNA,直接保留

# 写一个脚本,专门用于整合同一个参考基因组,使用miRDeep-P预测出来的miRNA

";

die$usageunless@ARGV;

my$ratio=shift;

for(@ARGV){

-s$_ordie"can not find $_\n$usage\n";

}

my$groupCounts='aaa';# 用于最终Counts的

my%chrSeen;

for(@ARGV){

my$file=$_;

openIN,'<',$fileordie"Can't read file $file\n";

while(<IN>){

chomp;

nextif$.==1;# 过滤掉头部行

my($ChrID,$Strand,$RepresentativeSeqID,$SubChrID,$MatureMIRregion,$PercusorMIRregion,$MatureSeq,$PercusorSeq,$Score,$TotalFreq,$MarueFreq,$StarFreq,$StrandBias)=split/\t/,$_;

my($curStart,$curEnd)=split /\.\./,$MatureMIRregion;

if($chrSeen{$ChrID}){

# 获取所有区域该染色体上非冗余成熟miRNA的区域

my@regionRefArr=keys%{$chrSeen{$ChrID}};

# 查看是否有成熟miRNA交集超过0.8的

my$isOverlap= 0;

for(@regionRefArr){

my($start,$end,$strand,$groupID)=split/\t/,$_;

# 如果是相同一条链上的

if($strand eq $Strand){

# 如果在相同染色体上,成熟体位置的交集长度超过短的成熟体长度的80%

if(&getOverlapRate($start,$end,$curStart,$curEnd)>=$ratio){

# 直接有Overlap 就直接跳出

$isOverlap= 1;

push@{$chrSeen{$ChrID}->{"$curStart\t$curEnd\t$Strand\t$groupID"}},[$ChrID,$Strand,$RepresentativeSeqID,$SubChrID,$MatureMIRregion,$PercusorMIRregion,$MatureSeq,$PercusorSeq,$Score,$TotalFreq,$MarueFreq,$StarFreq,$StrandBias,$groupID];

last;

}

}

}

if($isOverlap){

# 如果Overlap的话,就不用管了

# 因为上面已经记录了

}else{

# 如果没有Overlap的话

$groupCounts++;

push@{$chrSeen{$ChrID}->{"$curStart\t$curEnd\t$Strand\t$groupCounts"}},[$ChrID,$Strand,$RepresentativeSeqID,$SubChrID,$MatureMIRregion,$PercusorMIRregion,$MatureSeq,$PercusorSeq,$Score,$TotalFreq,$MarueFreq,$StarFreq,$StrandBias,$groupCounts];

}

}else{

# 记录新的Chr上面的记录

$groupCounts++;

push@{$chrSeen{$ChrID}->{"$curStart\t$curEnd\t$Strand\t$groupCounts"}},[$ChrID,$Strand,$RepresentativeSeqID,$SubChrID,$MatureMIRregion,$PercusorMIRregion,$MatureSeq,$PercusorSeq,$Score,$TotalFreq,$MarueFreq,$StarFreq,$StrandBias,$groupCounts];

}

}

closeIN;

}

# 输出Header

print"ChrID\tStrand\tRepresentativeSeqID\tSubChrID\tMatureMIRregion\tPercusorMIRregion\tMatureSeq\tPercusorSeq\tScore\tTotalFreq\tMarueFreq\tStarFreq\tStrandBias\tOverlapIDs\n";

for(sortkeys%chrSeen){

my$chrId=$_;

for(sort{(split/\t/,$a)[0]<=>(split/\t/,$b)[0]}keys%{$chrSeen{$chrId}}){

for(@{$chrSeen{$chrId}->{$_}}){

printjoin"\t",@{$_},"\n";

}

}

}

sub getOverlapRate{

my($start_1,$end_1,$start_2,$end_2)=@_;

# 这个一定是排序好的,

my$minStart=$start_1;

if($start_2<$start_1){

$minStart=$start_2;

}

my$maxEnd=$end_1;

if($end_2>$end_1){

$maxEnd=$end_2;

}

my$len_1=$end_1-$start_1;

my$len_2=$end_2-$start_2;

my$overlapLen=$end_1-$start_1+$end_2-$start_2-($maxEnd-$minStart);

# my $ratio =($end_1-$start_1+$end_2-$start_2)/($maxEnd-$minStart);

# 如果这个比值 ==1  那么就是刚好没有Overlap

# 如果这个比值 >1 那么就是 $ratio - 1 就是总体跨越区间的 Overlap 比例

# 如果这个比值 <1 那么就是没有Overlap

if($overlapLen<= 0){

return 0;

}else{

$len_2>$len_1?return$overlapLen/$len_1:return$overlapLen/$len_2;

}

}

__END__

# 获取最高频的 miRNA

perl -lane '$count{$F[-1]}->{$F[6]}++;END{for(sort keys %count){print"$_\t",(sort {$a<=>$b} keys %{$count{$_}})[0]}}'merged.prediction.raw > hookIdandSeq.ids

perl -F'\t' -lane 'if($flag){print if $seen{"$F[-1]\t$F[6]"} and$seen{"$F[-1]\t$F[6]"}++<2}else{$seen{"$F[0]\t$F[1]"}=1}$flag=1if eof(ARGV)' hookIdandSeq.ids merged.prediction.raw >merged.prediction.final

(0)

相关推荐