gravatar for m.radz

9 hours ago by

I need to extract the headers from a multifasta file being used as a database for metagenomic analysis. The structure of the fasta file is as follows:

>NW_002197112.1 Penicillium marneffei ATCC 18224 scf_1107713384177, whole genome shotgun sequence
GCCTTAAAATGCCGCTTCCCAGATCTGCGCCGAAGAGCAATCCATCTCCTCTCCAGCCCCAATGCAGCAACTGCTAACGG
CAGTGCGACGTGCGGGGTGAATTTCAGCGGTTGCTATCGACTTGTGCCATCGCAGCGTTTTCGCGTCCACGGTCGCCGCC
GCATGCTCCATGCACGATATGGCTGGTCGGATGCTAGTTGTGCTC

>NW_002197111.1 Penicillium marneffei ATCC 18224 scf_1107713383857, whole genome shotgun sequence
TACTGCTTTGTGGAACATCGCCCTTGTGGAGATCTCCCTCACGCTGGATGTTGAAAGACGCAGAACAGTTGGCACAGCCA
ATTTAGAATGCCTGATCAAGACGCATCGCCACATCCAGGCAGGTGCGATTCCTCTCTTATAAATAAATATTTTCAACGGC
ATCTGGAGAACTCATCAACTTGCAGTTGCTCATCATTATCTCGGTCAT

What I need to do is extract only the identifiers and the taxonomy to create a taxonomy.txt file with the same structure as shown below, with taxonomy separated by level and the taxon identifier in the final column:

Saccharomycetaceae;Kluyveromyces;lactis;CR382121.1

Saccharomycetaceae;Kluyveromyces;lactis;CR382122.1



Source link