I have a BED file with some elements that are located in overlapped intervals. I added unique IDs, and length in the 4th and 5th columns, respectively.

cat sorted_intervals.bed
NC_007117.7   52869911  52875049    NC_007117.7.27   5138
NC_007117.7   52869911  52870819    NC_007117.7.28   908
NC_007117.7   52869929  52870807    NC_007117.7.29   878
NC_007117.7   52869932  52870798    NC_007117.7.30   866
NC_007117.7   52869932  52870795    NC_007117.7.31   863
NC_007117.7   52869932  52870780    NC_007117.7.32   848
NC_007117.7   52869938  52870804    NC_007117.7.33   866
NC_007117.7   52869956  52870795    NC_007117.7.34   839
NC_007117.7   52874159  52875088    NC_007117.7.35   929
NC_007117.7   52874159  52875088    NC_007117.7.36   929
NC_007117.7   52874159  52875088    NC_007117.7.37   929
NC_007117.7   52874159  52875088    NC_007117.7.38   929
NC_007117.7   52874159  52875082    NC_007117.7.39   923
NC_007117.7   52874159  52875088    NC_007117.7.40   929
NC_007117.7   52874159  52875079    NC_007117.7.41   920
NC_007117.7   52874159  52875088    NC_007117.7.42   929
NC_007117.7   52874159  52875088    NC_007117.7.43   929
NC_007117.7   52874162  52875085    NC_007117.7.44   923
NC_007117.7   52874192  52875052    NC_007117.7.45   860
NC_007117.7   52874192  52875079    NC_007117.7.46   887
NC_007117.7   52874192  52875052    NC_007117.7.47   860

My goal is to find which elements are located in overlapping genomic regions and then select the one that spans the longest length. I also want to keep those elements that do not overlap with any others. To do so, in a previous analysis, I used the following command:

while read LINE ; do grep -wE "$LINE" sorted_intervals.bed | sort -k5,5nr | head -1 ; done < <(bedtools merge -i sorted_intervals.bed -c 4 -o count,collapse | awk '{print $5}' | sed 's/,/|/g')

which successfully outputs the element NC_007117.7.27 for the above example.

Now I want to repeat the same analysis and set up a cutoff value (let's say 10%) for the overlapping regions. That is, if two elements have overlap >=10%, keep the longest one; but if two elements have overlap <10%, keep both of them. Same as before, I also would like to keep elements that do not overlap with any other elements.

I have found similar questions, such as:

Bed file: merging intervals that overlap a certain percentage

multiple bed - merge regions IF overlapping more than xx percent of size

However, I am still not very sure how to do it, because when I run the following bedops command it gives me three different overlaps for the above example instead of just one as in bedtools merge. I also tried the option --fraction-either and it produced the same results. I think I might be misunderstanding something.

bedmap --count --echo-map-range --echo-map-id --fraction-both 0.1 --delim 't' sorted_intervals.bed | uniq | grep "NC_007117.7.27"
21  NC_007117.7 52869911    52875088    NC_007117.7.28;NC_007117.7.27;NC_007117.7.29;NC_007117.7.32;NC_007117.7.31;NC_007117.7.30;NC_007117.7.33;NC_007117.7.34;NC_007117.7.41;NC_007117.7.39;NC_007117.7.35;NC_007117.7.36;NC_007117.7.37;NC_007117.7.38;NC_007117.7.40;NC_007117.7.42;NC_007117.7.43;NC_007117.7.44;NC_007117.7.45;NC_007117.7.47;NC_007117.7.46
8   NC_007117.7 52869911    52875049    NC_007117.7.28;NC_007117.7.27;NC_007117.7.29;NC_007117.7.32;NC_007117.7.31;NC_007117.7.30;NC_007117.7.33;NC_007117.7.34
14  NC_007117.7 52869911    52875088    NC_007117.7.27;NC_007117.7.41;NC_007117.7.39;NC_007117.7.35;NC_007117.7.36;NC_007117.7.37;NC_007117.7.38;NC_007117.7.40;NC_007117.7.42;NC_007117.7.43;NC_007117.7.44;NC_007117.7.45;NC_007117.7.47;NC_007117.7.46      

Does anyone know what would be the best way to proceed?

Many thanks,

EDIT: I don't think I need the two elements to have a reciprocal overlapping percentage of 10%, just the smallest one among the two being compared. That is, if the smallest of the two elements has an overlap that is less than 10%, then both elements should be kept.

Source link