gravatar for lordbleys

2 hours ago by


I am assembling a virus genome from IonTorrent reads based on a reference fasta sequence, using a combination of Minimap2, Samtools, bcftools and such.

The problem is: my final consensus fasta sequence is always named after the reference sequence.

I want to have the name of the fasta sequence to be the one of the original sequence file, not the reference.

Here is the script I use:

# align reads to reference sequence with minimap2
minimap2 -ax sr reference.fasta SAMPLE1_trimmed.fastq > SAMPLE1_aln.sam

# Convert sam to bam
samtools view -S -b SAMPLE1_aln.sam > SAMPLE1_aln.bam

# Sort the alignment
samtools sort SAMPLE1_aln.bam -o SAMPLE1_aln_sorted.bam

# Get consensus fastq file
samtools mpileup -uf reference.fasta SAMPLE1_aln_sorted.bam | bcftools call -c | vcf2fq > SAMPLE1_consensus.fastq

# Convert .fastq to .fasta
seqtk seq -aQ64 -q13 -n N SAMPLE1_consensus.fastq > SAMPLE1.fasta

In this example, the fasta sequence in the SAMPLE1.fasta file should be named ">SAMPLE1", but it is always named ">reference".

I checked quite a few forums and the manuals of each software, but I can't figure out what to do. Indeed most of the questions relating to this problem deal with multiple sequences inside an alignment (chromosomes, multiple samples, etc). In my case I just have 1 sample, 1 sequence (the final consensus from the alignment) so I don't know where the name is fetched. And since I am running this job in parallel for multiple samples (each producing a unique final sequence), I want to find a solution more elegant than simply renaming the sequences 1 by 1...

Help me Obi-Wan Kenobi, you are my only hope.


Source link