After splitting multiallelic variants in my human multisample exonic germline VCF, the newly generated file contained many sites with '*' . The command I used is:
bcftools norm --check-ref w -f GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna -m -any exonic_variants.vcf.gz >bcftools_norm.vcf
The reason I am seeing this is because it is a spanning deletion (gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-#article-comments) and the input VCF file (generated using GATK) has this:
chr1 2503910 . A C,*
and it got split into:
chr1 2503910 . A C chr1 2503910 . A *
My question is how do I treat this scenario? Should I just remove sites with a '*' in the alternate allele? What is the best practice here?
My general goto scenario is to only concentrate on high quality biallelic variants (SNVs) without normalising variants as multiallleic sites are generally considered to be sequencing errors (unless I want to study genetic mosaicism). Since thats not my aim in my current study, is it advisable to not normalise my VCF and directly move towards variant filtration? As in the current study, I also have indels, I can only consider biallelic indels (-v indels -m2 -M2) which removes these sites with '*'.
PS I am using the latest version of bcftools (v1.11)