gravatar for biobaker

2 hours ago by

Hello! I have been banging my head against a wall for the past 2 days trying to get a fasta file of DNA sequences from a list of protein accessions across a broad range of organisms. I have roamed biostars for questions asking the same thing (it should be a common enough task!) but all of the answers I've found don't get me quite what I want.

Here are a few of the protein accessions for reference (I have ~400 in total):

WP_011071901.1
WP_011071902.1
WP_011071903.1
WP_011789901.1
WP_014610284.1
WP_101053336.1
WP_101053338.1

I would like the DNA coding sequence associated with these proteins. Of all the ways I've tried, these seem to be the 2 most promising approaches thus far.

Approach 1: Drawn directly from this answer on a previous post, I have tried to use eutils to link a protein query with the nuccore database. However, as stated in the linked answer, this returns ALL coding sequences from the nucleotide sequence, not just the CDS for the protein in question. For example, for a single protein accession, I would write:

esearch -query "WP_011071901.1" -db protein |elink -target nuccore|efetch -format fasta_cds_na

but this returns all CDS, not just the CDS for WP_011071901.1. Also, I would want to re-work this in a way where I can do a batch extraction of all my proteins, not just 1 at a time.

Approach 2: First find start and stop positions of protein coding sequence in associated nucleotide, then get DNA sequence from there.

I have successfully extracted the nucleotide accession and the start and stop positions already by making use of the code posted here.

Then, I know I can access the nucleotide sequence one at a time by typing

efetch -db nuccore -id NC_004347.2 -format fasta -seq_start 1859348 -seq_stop 1861363

but this is extremely inefficient given the number of proteins I am working with. I can also see this becoming an issue as some of these protein accessions are associated with more than 1 genome/nucleotide. I know which Nucleotide accession I would want to use for each associated protein if that helps for this approach.

Any guidance or help would be appreciated, thank you!



Source link