Hi, I am really struggling to figure out how to obtain genome assembly IDs from genbank for a batch of protein accessions.
I have locally downloaded a set of several thousand bacterial genomes, which I am concatenating and using to score with a set of custom HMMs using HMMer. I have been using the _translated_cds.faa files from NCBI's FTP, when available, since they have more information in the header lines (compared to the _protein.faa files). However, neither of these have any information which links a protein hit back to the genome assembly from which it was obtained!
How can I map my hits back to the original assemblies they came from? This seems so crazy that the translated_CDS headers don't have any information about the actual genome/assembly/organism which they came from!!!
My best bet so far has been to look at each individual .faa file (before I'd concatenated them), since the assembly ID is present in the filename.
(For example: GCA_010091945.1_ASM1009194v1_translated_cds.faa >>> this supplies me with GCA_010091945.1, which I can then use to link to a table with all the metadata for the assemblies I'm searching he assembly) (built from the giant table "assembly_summary_genbank.txt" which I can find somewhere in NCBI's FTP site.)
Then I grab the prefix of the locus tag for the first header and map it to the assembly ID from the filename:
(eg, I grab "GS601_" from this header: >lcl|WVIE01000010.1_prot_NDJ17650.1_1 [locus_tag=GS601_10170] [protein=SPASM domain-containing protein] [protein_id=NDJ17650.1] [location=complement(79..1437)] [gbkey=CDS]))
(It would be better to use grep -H for all of the protein IDs of all the hits, but I can't get grep to work in a for loop, for some reason).
Does anyone have a simpler way of doing this?