Filtering variants for genertion of consensus sequence



I used bcftools to generate a bcf file with a command like this:

bcftools mpileup -f reference.fa -d 8000 alignments.bam | bcftools call -mv -Ov -o calls.vcf

Now, I'd like to use the bcftools consensus and use a genomic reference, to generate a consensus sequence. I am not interested in calling rare variants. Tis will be done for the whole genome, and I can tolerate some level of false positives. My vcf file contains information from several samples (multiple bam files were used), but I only want to calculate the consensus for all sample together. I expect the samples to be identical in terms of genotype, and I am not trying to find differences differences between them in this experiment (multiple sample are used only to increase depth).

My question is about the suggested starting parameters to filter the vcf file before generating the consensus with bcftools consensus

I am using the below command form the bcftools manual as a starting point, but I don't understand fully the meaning of the expressions. For example what are RBS and DV?

bcftools filter -sLowQual -g3 -G10 
    -e'%QUAL<10 || (RPB<0.1 && %QUAL<15) || (AC<2 && %QUAL<15) || %MAX(DV)<=3 || %MAX(DV)/%MAX(DP)<=0.3' 

My criteria for what I would like to include in my consensus are as follows:

  1. At least depth of 10 (refers to cumulative depth from all sample, so
    if I have 10 samples, each contributing to only DP=1 at a position, this criterion would be
  2. A variant frequency of at least 0.7
  3. Some quality filtering (not sure what would be a good value to start).

I would be great to have some tips from you on how to set this filtering.








Source link