gravatar for Marius

2 hours ago by


I was wondering if there is any data set that I can use as ground truth to compare which tool behaves better at counting the number of reads within a BAM file for instance. So far, I've ran bcftools mpileup which is supposedly only a wrapper for samtools mpileup (which I also ran) and a third tool called bam-readcount (

I ran all three tools on the same data set making sure that the quality thresholds/filtering parameters are the same for all three tools. However, when I look at the results and compare the read counts for the same chromosome position (i.e., unique ID based on chr_pos_base) I see that the counts reported highly vary.

I've attached some boxplots here which show the absolute difference in read counts per reported position for each pair of tools tested. BRC is bam-readcount, BCF is bcftools mpileup, and SAM is samtools mpileup. Performed the analysis on different tissues from the same sample (X-axis). What is intriguing from those plots, is that samtools mpileup vs. bcftools mpileup are not agreeing as much as expected. Which is weird, because the bcftools implementation of mpileup should be just a wrapper according to this. For instance, the variance in the absolute difference in counts between samtools mpileup vs. bam-readcount is a lot smaller.

Any pointers how I can assess which tool is closest to the "truth" are appreciated.


modified 2 hours ago

2 hours ago


Source link