Thanks very much for providing the VCF that you're using. For others, it's

This VCF is corrupt and does not conform to the VCF specification. It has the following issues:

  • whitespace in 'INFO' column
  • contig '2' not defined in header
  • 'FORMAT/GL' should be declared as Number
  • 'FORMAT/PP' not defined in header
  • 'FORMAT/BD' not defined in header

I was able to fix the VCF with these commands (below). Unfortunately, the 'FORMAT' field is a complete mess, so, I made an 'executive' decision to remove it, leaving just 'FORMAT/GT'. This loses some info, but leaves you with a validated VCF for anything else that you may want to do.

zcat GEUVADIS.chr2.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz |
  sed 's/ damaging/_damaging/g' |
  bgzip > test.vcf.gz ;
tabix -p vcf test.vcf.gz ;
bcftools annotate -x 'FORMAT' --force test.vcf.gz -Oz > test.fixed.vcf.gz ;

This will initially show the warnings relating to 'FORMAT', but the use of --force allows us to skip these warnings. Also, by removing the problematic 'FORMAT' tags, we avoid the subsequent segmentation fault that occurs.

bcftools annotate -x ID test.fixed.vcf.gz -Oz > test.fixed.noID.vcf.gz ;
java -jar SnpSift.jar annotate dbSnp144.vcf test.fixed.noID.vcf.gz

Kevin



Source link