Fast parsing genbank files

1

Hi,

Does anyone know of a fast parser for genbank files that contains hundres of entries (e.g., all vertebrate_mammlian proteins from refseq)?
Ive tried R's genbankr's readGenBank function and biofile's gbRecord function and both are very slow and insufficient for genbank files of a size of 100M.

My purpose is simply to parse for each protein it's transcript accession, gene accession, taxonomy ID, and all its conserved domain IDs (CDDs).

genbankr does have a faster parsing function: parseGenBank but it simply contains all features in an array from which it does not seem possible to map them back to their respective proteins.


Genbank


parsing


R

• 54 views



Source link