gravatar for scottbrainard

2 hours ago by

I want to remove sites from a VCF file with an overabundance or an underabundance of heterozygous genotypes. I have 8940 markers to start with:

bcftools view unfiltered.bcf.gz -H | wc -l # 8940

When I exclude sites with a frequency of hets > 0.9, this number goes down a bit:

bcftools filter unfiltered.bcf.gz -e 'F_PASS(GT="het") > 0.9' | bcftools view -H | wc -l #8905

When I exclude sites with a frequency of hets < 0.1, suddenly, almost every marker is removed:

bcftools filter unfiltered.bcf.gz -e 'F_PASS(GT="het") < 0.1' | bcftools view -H | wc -l #9

This does not reflect the actual distribution of heterozygous frequencies in this population (validated using bcftools stats). Indeed, when I perform what to my mind is an equivalent command (i.e., include only sites with a frequency of hets >= 0.1) I get a more reasonable reduction:

bcftools filter unfiltered.bcf.gz -i 'F_PASS(GT="het") >= 0.1' | bcftools view -H | wc -l #8909

Any explanation for the difference in output between these later two commands would be appreciated.



Source link