One of the recurring questions on biostars is "How can I create a consensus sequence from my bam file?" A variation of these question is "How to get fasta out of bam file?".
The rational behind this question is usually, to get the sequence for a given region of interest of my sample, so that one can do a multiple alignment between different samples/species.
The answer to this question is a simple two-step workflow:
- Call variants
- Include the variants into the reference sequence
Do the variant calling step with your favorite program/workflow, e.g. with
$ bcftools mpileup -Ou -f ref.fa input.bam | bcftools call -Ou -mv | bcftools norm -f ref.fa -Oz -o output.vcf.gz
In the end, one needs a valid, sorted
vcf file which is compressed with
bgzip and indexed with
I also recommend to normalize your vcf file (In the above command this is done by
vcf isn't already compressed you can do this with:
$ bgzip -c output.vcf > output.vcf.gz
And indexing is done by:
$ tabix output.vcf.gz
Now we are ready to create our consensus sequence. The tool of choice is bcftools consensus.
If one like to create a complete new reference genome, based on the called variants, this is done by:
$ bcftools consensus -f ref.fa output.vcf.gz > out.fa
If you are interested in only a given region:
$ samtools faidx ref.fa 8:11870-11890 | bcftools consensus output.vcf.gz -o out.fa
You can also create consensus sequences for multiple regions, if you provide a
bed file to
$ samtools faidx -r regions.bed ref.fa | bcftools consensus output.vcf.gz -o out.fa
Have fun 🙂
Edit by GenoMax: Added a missing
- in first
bcftools norm command.