gravatar for Buxus

2 hours ago by

Greetings,

I'm trying to get a sequence for each sample in a multi-sample vcf by combining a reference sequence with the variants from the vcf.

The problem is that there are a few variants that overlap with indels. Some are correctly (I believe) denoted by the "*" symbol as per vcf 4.3 specification, other are not. There are lines where the ALT allele is "*" but does not seem to overlap with any other variant. Below is an example of such an entry. The variant at position 14586 is called as * when there are no overlaps with either the previous or the subsequent variant in this sample.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  CAR10
MT  14559   .   TAAA    TA,T,TAA    4.35685e+06 .   AC=1,0,0
MT  14586   .   AATATATATATATATATATATAT AATATATAT,AATATATATAT,*,AATATATATATATAT,AATATATATATATATAT,AATAT 881243  .   AC=0,0,1,0,0,0
MT  15129   .   C   A   2.69538e+06 .   AC=1

Is this a problem with the vcf file or am I misunderstanding the format?

I'm using bcftools to do the following:

  1. susbet the multisample vcf to get single sample vcf (bcftools view)
  2. normalise (bcftools norm)
  3. combine reference fasta and the vcf (bcftools consensus)

bcftools consensus skips the variants that overlap with the previous variant. I believe this is done by comparing the position of a variant with the end position of the previous variant.

In the case above it does not skip the variant at position 14586 but instead includes a "*" symbol into the output fasta file. Should it just use the reference instead?

link

modified 2 hours ago

written
2 hours ago
by

Buxus0



Source link