gravatar for Rubal

3 hours ago by

Germany

I have lists of contig coordinates for several assemblies and would like to create a bed format mask to exclude variants in the first and last 1Mb of each contig.

Example lines from the contig files:

Chr1 1 123000000
Chr2 1 11435255
AEG1.2 1 2335

I could do something simple using awk like this

awk '{print ($1,$2+1000000,$3 - 1000000)}' contig.bed > filter_ends.bed

This would be a positive mask of regions to keep and I'd prefer a negative mask (though that's not essential).
But it would not behave properly for contigs that are < 2000000bp, it would return non existent or negative coordinates.

Effectively I will be excluding those contigs anyway because the filtering from both ends will overlap. I could do this in two steps but as I have many assemblies to run over does anyone know a good approach for this? I suppose for example first one could remove the contigs < 2000000 and then run the awk command.

Thanks in advance for your suggestions.

link

written
3 hours ago
by

Rubal310



Source link