I am calculating coverage over the CDS from a bam file using samtools depth and Rsamtools pileup. I expected the result to be the same, or at least very similar, but that hasn't quite been the case.

With samtools, I am counting bases covered by at least one read like so:

grep $gene_id $gff | grep CDS | gff2bed | 
samtools depth -aa -Q 10 -b - $bam_path | 
cut -f3 | grep -v 0 | wc -l

The output, prior to the line starting cut... looks like this:

❯ head ~/Desktop/tmp/coverage_test_out.txt                                                                                     
CP022326.1  650760  5
CP022326.1  650761  5
CP022326.1  650762  5
CP022326.1  650763  5
CP022326.1  650764  5
CP022326.1  650765  5
CP022326.1  650766  5
CP022326.1  650767  5
CP022326.1  650768  12
CP022326.1  650769  12
...

With Rsamtools::pileup, I am doing the following:

#'
#' @param bamfile_path path to a bam file 
#' @param gr a GRanges object with the range(s) of interest
#' @param strandedness one of c("stranded", "unstranded"). NOTE: forward only strand NOT currently configured
#' @param quality_threshold default 10L, equivalent to current HTSeq default
#'
#' @return number of covered bases in CDS
#'
calculateCoverage_test = function(bamfile_path, gr, strandedness, quality_threshold=10L,
                             coverage_threshold=1, lts_align_expr_prefix=Sys.getenv("LTS_ALIGN_EXPR_PREFIX"),
                             bamfile_suffix='"_sorted_aligned_reads_with_annote.bam"'){

  bamfile_index = paste0(bamfile_path, ".bai")

  p_param <- PileupParam(min_mapq = quality_threshold,
                         min_nucleotide_depth = coverage_threshold,
                         distinguish_strands=TRUE,
                         distinguish_nucleotides=FALSE)

  # create the scanBamParam object.
  gene_strand = as.character(unique(data.frame(gr)$strand))
  minus_strand_flag = NA
  if (strandedness=='reverse' & gene_strand == '+'){
    minus_strand_flag = TRUE
  } else if (strandedness=='reverse' & gene_strand == '-'){
    minus_strand_flag = FALSE
  }

  sbp = ScanBamParam(which=gr,
                     flag=scanBamFlag(isMinusStrand=minus_strand_flag,
                                      isSecondaryAlignment=FALSE,
                                      isNotPassingQualityControls=FALSE,
                                      isSupplementaryAlignment=FALSE))

  coverage_df = pileup(bamfile_path,
                       index = bamfile_index,
                       scanBamParam = sbp,
                       pileupParam = p_param)

# take only distinct positions, and create a boolean vector 
# based on whether the count for a given position is > 0. 
# Sum over the boolean vector
  sum(coverage_out %>% 
            distinct(pos, .keep_all=TRUE) %>% 
            pull(count) > 0)

coverage_df, before the sum() cmd, looks like this:

> head(coverage_out)
    seqnames    pos strand count              which_label
1 CP022326.1 650760      -     5 CP022326.1:650760-650894
2 CP022326.1 650761      -     5 CP022326.1:650760-650894
3 CP022326.1 650762      -     5 CP022326.1:650760-650894
4 CP022326.1 650763      -     5 CP022326.1:650760-650894
5 CP022326.1 650764      -     5 CP022326.1:650760-650894
6 CP022326.1 650765      -     5 CP022326.1:650760-650894

In both cases, I am then dividing by the length of the gene. Thankfully, the gene length does come out the same from both tools.

However, the RSamtools method returns 167 more covered bases than the samtools depth command. That is a difference between 98% covered and 89% covered, respectively. In looking at the browser, it is hard to tell -- this gene is supposed to be knocked out, and obviously is not. The difference between 89% and 98% covered is pretty close, to my eye. That said, I'd like to know what I'm doing that is causing the difference.

If anyone has any suggestions or ideas, I would appreciate it.



Source link