为什么sort,cut的列失效了?
有一个需求,是把peaks注释前后的文件合并起来,不想写脚本,想就用shell命令来做到。
很简单,两个文件都按照peaks的name排序,然后paste即可。
本来是想偷个懒,结果搞了大半个小时,最后还是用了perl,记录一下这个悲催经历。
我可以很肯定,文件都是以tab符号分割的列。
我都以第一列,也就是peaks的name排序,但是诡异的事情发生了。
命令如下:
sed '1d' 57030_sort_peaks.peakAnn.xls |cut -f 2-8,1 |sort -t $'\t' -k 8 |head -12 cut -f 4,7-9 57030_sort_peaks.narrowPeak.bed |sort -t $'\t' -k 1 |head -12
排序结果如下:
sed '1d' 57030_sort_peaks.peakAnn.xls |sort -t$'\t' -k1 |cut -f 1-8 |head -12 peak1000 chr16 2513739 2513891 + 4.00941 NA promoter-TSS (NM_001198569) peak1001 chr16 2777421 2777567 + 4.54262 NA promoter-TSS (NM_007108) peak1002 chr16 2904949 2905145 + 4.40547 NA Intergenic peak1003 chr16 3059613 3059759 + 4.06547 NA promoter-TSS (NR_123723) peak1004 chr16 3282982 3283229 + 6.19764 NA promoter-TSS (NM_005741) peak1005 chr16 3457701 3457880 + 4.26274 NA promoter-TSS (NM_001317097) peak1006 chr16 4412132 4412322 + 4.11112 NA intron (NR_145128, intron 3 of 27) peak1007 chr16 4425589 4425781 + 6.69260 NA promoter-TSS (NM_001135110) peak1008 chr16 4538610 4538771 + 4.44393 NA promoter-TSS (NM_001199054) peak1009 chr16 9091684 9092031 + 4.99539 NA 5' UTR (NM_014117, exon 1 of 4) peak100 chr1 66930370 66930598 + 5.48491 NA intron (NM_001077704, intron 1 of 13) peak1010 chr16 14928161 14928337 + 6.77548 NA promoter-TSS (NR_103772) cut -f 4,7-9 57030_sort_peaks.narrowPeak.bed |sort -t$'\t' -k1 |head -12 peak1000 4.00941 10.78589 6.44162 peak1001 4.54262 12.95468 8.39192 peak1002 4.40547 9.16804 5.03032 peak1003 4.06547 8.97515 4.87603 peak1004 6.19764 16.08099 11.28833 peak1005 4.26274 9.14970 5.01487 peak100 5.48491 9.93370 5.68875 peak1006 4.11112 10.51290 6.19959 peak1007 6.69260 18.52489 13.59905 peak1008 4.44393 9.86713 5.64038 peak1009 4.99539 12.14135 7.65193 peak1010 6.77548 13.81271 9.17838
我发现那个peak100
很魔性,顺序各种变化!
可以看得出来,这个排序很失败,看起来应该是并没有按照第一列排序,因为我把sort命令的-t$'\t' -k1
参数去掉后,它们的排序并没有改变。
我还特意抓取第一列测试了一下,代码如下;
sed '1d' 57030_sort_peaks.peakAnn.xls |cut -f 1| sort -k1 |head cat 57030_sort_peaks.narrowPeak.bed |cut -f 4 | sort -k1 |head
下面是正常的排序。
sed '1d' 57030_sort_peaks.peakAnn.xls |cut -f 1| sort -k1 |head peak1 peak10 peak100 peak1000 peak1001 peak1002 peak1003 peak1004 peak1005 peak1006 cat 57030_sort_peaks.narrowPeak.bed |cut -f 4 | sort -k1 |head peak1 peak10 peak100 peak1000 peak1001 peak1002 peak1003 peak1004 peak1005 peak1006
查了一些资料,发现搞不明白为什么,就算了,反正我喜欢perl,就干脆用perl排序吧!
sed '1d' 57030_sort_peaks.peakAnn.xls |perl -alne '{$h{$F[0]}=$_}END{print $h{$_} foreach sort keys %h}' |cut -f 1-8 |head -12 peak1 chr1 191368 191561 + 6.45284 NA Intergenic peak10 chr1 2652248 2652508 + 4.24389 NA intron (NM_001242672, intron 4 of 6) peak100 chr1 66930370 66930598 + 5.48491 NA intron (NM_001077704, intron 1 of 13) peak1000 chr16 2513739 2513891 + 4.00941 NA promoter-TSS (NM_001198569) peak1001 chr16 2777421 2777567 + 4.54262 NA promoter-TSS (NM_007108) peak1002 chr16 2904949 2905145 + 4.40547 NA Intergenic peak1003 chr16 3059613 3059759 + 4.06547 NA promoter-TSS (NR_123723) peak1004 chr16 3282982 3283229 + 6.19764 NA promoter-TSS (NM_005741) peak1005 chr16 3457701 3457880 + 4.26274 NA promoter-TSS (NM_001317097) peak1006 chr16 4412132 4412322 + 4.11112 NA intron (NR_145128, intron 3 of 27) peak1007 chr16 4425589 4425781 + 6.69260 NA promoter-TSS (NM_001135110) peak1008 chr16 4538610 4538771 + 4.44393 NA promoter-TSS (NM_001199054) cut -f 4,7-9 57030_sort_peaks.narrowPeak.bed |perl -alne '{$h{$F[0]}=$_}END{print $h{$_} foreach sort keys %h}' |head -12 peak1 6.45284 12.81044 8.25841 peak10 4.24389 20.70743 15.66803 peak100 5.48491 9.93370 5.68875 peak1000 4.00941 10.78589 6.44162 peak1001 4.54262 12.95468 8.39192 peak1002 4.40547 9.16804 5.03032 peak1003 4.06547 8.97515 4.87603 peak1004 6.19764 16.08099 11.28833 peak1005 4.26274 9.14970 5.01487 peak1006 4.11112 10.51290 6.19959 peak1007 6.69260 18.52489 13.59905 peak1008 4.44393 9.86713 5.64038
虽然代码有点长,但也比耽搁大半个小时要划算。
读者朋友们可以帮忙看看这是为什么。
直接复制文章里面的数据就可以作为测试数据啦。
虽然我看了下面这个sort命令的解释,但是仍然没搞明白为什么~ sort命令的k选项大讨论
我还是觉得这些文件是有问题的,因为cut命令也没办法按照tab分隔符正常的取第几列。
但是最后也没找出来问题出在哪里。