gravatar for RBright21

3 hours ago by

When using bedtools genomecov max function I get values reported for some chromosomes/reference sequences which are above the max depth set. This is not consistent between files and always occurs on the last base or two of the reference genome. The number of chromosomes affected varies between bam files.

For example using code:

bedtools genomecov -ibam In.bam -bga -max 1 > out.txt

for this particular file I got all 734562 lines reported as either 0 or 1 (covered vs uncovered) with the exception of 7 lines:

NC_011203.1 9997    9998    1
NC_011203.1 9998    9999    1
NC_011203.1 9999    10000   1
MH558113.1  35932   35933   7
AC_000008.1 35937   35938   8
NC_001405.1 35936   35937   8
AC_000017.1 36000   36001   9
HQ003817.1  35817   35818   9
HQ413315.1  35758   35759   9
MH121097.1  35750   35752   92

I have looked at the bam on IGV and there are 92 aligned reads at bases 35750 to 35752 for MH121097.1and the reference genome is 35752bp in length.

Does anyone know how I can fix this or where I can get some help? Should I report as an issue on GitHub (I was wary of doing so incase It was something I was doing wrong)?

For context: my bam file contains reads aligned against 104 different reference sequences using bowtie2 all. I am trying to find a way to write a script which could easily tell me which of these reference genomes are completely covered by my reads to look for mixed infections without having to do it manually. I think the way I am trying to do it is quite convoluted though and I may be missing a simpler solution. My steps are:

  1. bedtools genomecov -ibam In.bam -bga -max 1 > out.txt
  2. awk -F"t" '!seen[$1, $4]++' out.txt > awk.txt awk to keep lines which are unique in both the chromosome name and the depth - so keep chromosome name with coverage =0 and chromosome name with coverage =1 meaning I am left with one instance of covered and one instance of an area uncovered
  3. awk '{print $1}' awk.txt > awk2.txt remove all columns except the chromosome name
  4. uniq -u awk2.txt > uniq.txt use uniq to identify unique chromosome names (those which only have an entry with a 1 representing entirely covered genomes)

This works very well when I don't get these odd max depths which are over the defined max value of 1 but if anyone can think of a simpler way I can achieve this then I would also be very grateful for your help


Source link