I want to remove sites from a VCF file with an overabundance or an underabundance of heterozygous genotypes. I have 8940 markers to start with:
bcftools view unfiltered.bcf.gz -H | wc -l # 8940
When I exclude sites with a frequency of hets > 0.9, this number goes down a bit:
bcftools filter unfiltered.bcf.gz -e 'F_PASS(GT="het") > 0.9' | bcftools view -H | wc -l #8905
When I exclude sites with a frequency of hets < 0.1, suddenly, almost every marker is removed:
bcftools filter unfiltered.bcf.gz -e 'F_PASS(GT="het") < 0.1' | bcftools view -H | wc -l #9
This does not reflect the actual distribution of heterozygous frequencies in this population (validated using bcftools stats
). Indeed, when I perform what to my mind is an equivalent command (i.e., include only sites with a frequency of hets >= 0.1) I get a more reasonable reduction:
bcftools filter unfiltered.bcf.gz -i 'F_PASS(GT="het") >= 0.1' | bcftools view -H | wc -l #8909
Any explanation for the difference in output between these later two commands would be appreciated.