I need to extract the headers from a multifasta file being used as a database for metagenomic analysis. The structure of the fasta file is as follows:
>NW_002197112.1 Penicillium marneffei ATCC 18224 scf_1107713384177, whole genome shotgun sequence
GCCTTAAAATGCCGCTTCCCAGATCTGCGCCGAAGAGCAATCCATCTCCTCTCCAGCCCCAATGCAGCAACTGCTAACGG
CAGTGCGACGTGCGGGGTGAATTTCAGCGGTTGCTATCGACTTGTGCCATCGCAGCGTTTTCGCGTCCACGGTCGCCGCC
GCATGCTCCATGCACGATATGGCTGGTCGGATGCTAGTTGTGCTC
>NW_002197111.1 Penicillium marneffei ATCC 18224 scf_1107713383857, whole genome shotgun sequence
TACTGCTTTGTGGAACATCGCCCTTGTGGAGATCTCCCTCACGCTGGATGTTGAAAGACGCAGAACAGTTGGCACAGCCA
ATTTAGAATGCCTGATCAAGACGCATCGCCACATCCAGGCAGGTGCGATTCCTCTCTTATAAATAAATATTTTCAACGGC
ATCTGGAGAACTCATCAACTTGCAGTTGCTCATCATTATCTCGGTCAT
What I need to do is extract only the identifiers and the taxonomy to create a taxonomy.txt file with the same structure as shown below, with taxonomy separated by level and the taxon identifier in the final column:
Saccharomycetaceae;Kluyveromyces;lactis;CR382121.1
Saccharomycetaceae;Kluyveromyces;lactis;CR382122.1