Hi,
I was wondering if there is any data set that I can use as ground truth to compare which tool behaves better at counting the number of reads within a BAM file for instance. So far, I've ran bcftools mpileup
which is supposedly only a wrapper for samtools mpileup
(which I also ran) and a third tool called bam-readcount
(github.com/genome/bam-readcount).
I ran all three tools on the same data set making sure that the quality thresholds/filtering parameters are the same for all three tools. However, when I look at the results and compare the read counts for the same chromosome position (i.e., unique ID based on chr_pos_base
) I see that the counts reported highly vary.
I've attached some boxplots here which show the absolute difference in read counts per reported position for each pair of tools tested. BRC
is bam-readcount
, BCF
is bcftools mpileup
, and SAM
is samtools mpileup
. Performed the analysis on different tissues from the same sample (X-axis). What is intriguing from those plots, is that samtools mpileup
vs. bcftools mpileup
are not agreeing as much as expected. Which is weird, because the bcftools
implementation of mpileup
should be just a wrapper according to this. For instance, the variance in the absolute difference in counts between samtools mpileup
vs. bam-readcount
is a lot smaller.
Any pointers how I can assess which tool is closest to the "truth" are appreciated.