I need to parse accessory gene sequences (both dna and amino acid sequences)
from roary pangenome
output. I have the locus_tag
list and their corresponding gbk and gff
files, Is there any way to extract both amino acid and dna sequences from the gbk or gff
files.The gbk and gff
file were generated through prokka pipeline
. Is there any tool to do the same.
The roary
accessory genes locus_tag
list and corresponding strain gbk
and gff
file samples are shown below,
locus_tag list.csv
locus_tag/Pcissicola19
xynB_1 BGDHLHFA_02833
smpB BGDHLHFA_01427
Pcissicola19.gbk
gene complement(39965..40852)
/gene="xynB_3"
/locus_tag="BGDHLHFA_02833"
CDS complement(39965..40852)
/gene="xynB_3"
/locus_tag="BGDHLHFA_02833"
/EC_number="3.2.1.37"
/inference="ab initio prediction:Prodigal:002006"
/inference="similar to AA sequence:UniProtKB:P36906"
/codon_start=1
/transl_table=11
/product="Beta-xylosidase"
/protein_id="Prokka:BGDHLHFA_02833"
/translation="MPELLAFVAKHKLPIDFVTTHTYGVDGGFLDENGKQDTKLSASL
DAIVGDVRRVRAQIQASPFPNLPLYFTQWSSSYTPRDFVHDSYISAPYILTKLKQVQG
LVQGMSYWTYTDLFEEPGPPPTPFHGGFGLMNREGIRKPAWFAYKYLHALKGRDVPLS
DAHSLAAVDGTRVAALVWNWQQPMQAVSNTPFYTKQVPATDSAPLRMRMTHVPAGTYQ
LQVRKTGYRRNDPLSLYIDMGMPKDLAPRQLTQLRQATHDAPEQDRRVRVGADGVVEI
NVPMRSNDVVLLTLEPAAR"
Pcissicola19.gff
ID=BGDHLHFA_02833_gene;Name=xynB_3;gene=xynB_3;locus_tag=BGDHLHFA_02833
gnl|Prokka|BGDHLHFA_249 Prodigal:002006 CDS 39965 40852 . - 0 ID=BGDHLHFA_02833;Parent=BGDHLHFA_02833_gene;eC_number=3.2.1.37;Name=xynB_3;gene=xynB_3;inference=ab initio prediction:Prodigal:002006,similar to AA sequence:UniProtKB:P36906;locus_tag=BGDHLHFA_02833;product=Beta-xylosidase;protein_id=gnl|Prokka|BGDHLHFA_02833
For your kind reference my datasets having both draft genome and complete genomes.
The expected dna and amino acid sequence output is given below respectively,
>BGDHLHFA_02833
tcagcgcgccgccggctccagcgtcagcagcaccacatcgttgctgcgcatcggcacgttgatctcgaccacgccatcggcgcccacacgcacacgccgatcctgttcgggcatcgtgcgtggcctgtcgcagctgcgtcaactggcgcggcgccaggtccttgggcatgcccatgtcgatgtacagcgacaacgggtcgttacgccgatagccggtcttgcgcacctgcagctggtacgtgccggcaggcacatgggtcatgcgcatgcgcagcggcgcgctgtcggtggcgggcacctgtttggtgtagaacggcgtattgctcaccgcctgcatgggctgctgccaattccacaccagtgcggcgacgcgcgtgccgtccactgcggcgagggaatgtgcgtcgctcagcggcacatcgcggcccttgagcgcatgcaagtacttgtaagcgaaccaggccggtttgcgaatgccttcgcgattcatcagcccaaacccgccgtggaagggcgtgggcggtgggccgggttcttcgaacagatcggtatagtccagtaactcatgccctgcaccaggccctgcacctgcttgagcttggtcaggatgtacggcgcgctgatgtaactgtcgtggacgaaatcgcgcggcgtatagctgctgctccactgggtgaagtacagcggcaggttgggaaatggcgaggcctggatctgcgcgcgcacgcgtcgcacatcgccgacgatggcatccagagatgcggacagcttggtgtcctgcttgccgttctcatcgagaaacccgccatccacgccataggtatgcgtggtgacgaagtcgatcggcagtttgtgcttggcaacgaaggccagcagttccggcac
>BGDHLHFA_02833
MPELLAFVAKHKLPIDFVTTHTYGVDGGFLDENGKQDTKLSASLDAIVGDVRRVRAQIQASPFPNLPLYFTQWSSSYTPRDFVHDSYISAPYILTKLKQVQGLVQGMSYWTYTDLFEEPGPPPTPFHGGFGLMNREGIRKPAWFAYKYLHALKGRDVPLSDAHSLAAVDGTRVAALVWNWQQPMQAVSNTPFYTKQVPATDSAPLRMRMTHVPAGTYQLQVRKTGYRRNDPLSLYIDMGMPKDLAPRQLTQLRQATHDAPEQDRRVRVGADGVVEINVPMRSNDVVLLTLEPAAR