Hi All,

I'm trying to pull the sequences for a list of 3-400,000 accessions from the NCBI protein database via the biopython entrez api. For my specific use case, I'm processing them in 20,000 groups of around 1000 sequences (for context, each group is an smcog, but that doesn't really matter here), but for the purposes of this question I'm keen to just get as many as possible in one go. For example, if I can find a way to extract 100,000 sequences, I could get the sequences in one go and split them into groups later.

I've heard people talking about extracting e.g. 100,000 sequences with EFetch (www.biostars.org/p/280872/), but I'm struggling to get more than 200 sequences at a time. I've tried various methods, including making a http POST as suggested by the docs, but they're all hitting similar limits beyond around 200 accessions. I think this is because there seems to be a limit on the length of URLs that can be taken by browsers (stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url-in-different-browsers), which presumably impacts the length of the URLs that can be accepted by http POST calls. But if this is true, how can someone manage to get 100,000 sequences?

My latest attempt, below, is iteratively calling EPost to make a list of query keys for a single web env, with each key corresponding to 200 accessions (the limit I've been running up against). So for a group of 1000 accessions, the final output will be a web env and a list of query keys e.g. ['1','2','3','4','5']. My question is, can I join these query keys into a single value for the query_key parameter in EFetch? This would be more efficient than going through each key in a for loop I think, and the url size being generated should be fairly low as it is a list of keys rather than 1000 accessions. I've looked here dataguide.nlm.nih.gov/eutilities/history.html, and have tried building a query key string term with %23 or # with the AND key word, but neither work (HTTP Error 500: Internal Server Error), although the link implies what I want to do is possible.

Also, please let me know if this all sounds hideously inefficient, most of this is coming from the docs so there is probably a much better way of doing this. Due to the nature of the accessions (they're from 2012 and NCBI has since done a massive re-org with non-redundant ref seqs, so the local databases may not have these accessions) and my crappy laptop, I'm trying to avoid downloading the database locally, although this might be better. I'd also really like to get this api stuff working if I can, as a learning experience, rather than give up.

This is all part of a bigger script, but the issues described above are in this bit:

#make a web environment and add query keys for each accession_list, 
#which together make accession_list_whole

webenv_assigned=False
webenv=''
query_keys=[]
server_limit=200

#accession_list_whole is a big list of accessions e.g. ['accession 1', 'accession 2',...] for >1000 accessions.   
for index in range(0,len(accession_list_whole), server_limit):     

    #make useable accession list
    accession_list=accession_list_whole[index:index+server_limit]

    #feed into epost.  
    teststr=','.join(accession_list)
    if webenv_assigned:
        #build up a list of query keys for a single web env
        epost=Entrez.epost(db="protein", id=','.join(accession_list), WebEnv=webenv) 
        search_results = Entrez.read(epost)
    else:
        #for first loop, assign a web env
        epost=Entrez.epost(db="protein", id=','.join(accession_list)) 
        search_results = Entrez.read(epost)
    webenv = search_results["WebEnv"]
    query_keys.append(search_results["QueryKey"])
    webenv_assigned=True

'''I now have a web env (MCID_5fa80e1470e80f0dc50d04fb) and a list of query keys [1,2,3,4,5].  The next bit of code feeds the web env and query keys to efetch.  I could do this in a for loop and iteratively call efetch for each query key, appending the resultant seqRecords to object_new, but if I can do this all at once i think it will be more efficient.'''

for attempt in range(5): 
    #for loop = this is just in case there's a weird server error, probably unnecessary
    try:
        object_new=[]
        ################################
        #What is the correct query_key  parameter format here?  
        #I have tied %23<key> and #<key> with AND, neither work.  
        ################################
        join='#'
        bool='+AND+'
        queryKey=join+query_keys[0]+bool
        for key in query_keys[1:-1]:
            queryKey+=join+key+bool
        queryKey+=join+query_keys[-1]
        ################################
        #--> 1+AND+#2+AND+#3+AND+#4+AND+#5+AND+#6+AND+#7+AND+#8+AND+#9
        ################################

        fetch=Entrez.efetch(
            db="protein",
            rettype='fasta',#rettype="gb", 
            retmax=len(accession_list_whole),
            webenv=webenv,
            #the annoying param:
            query_key=queryKey
            ) 
        object_new=list(SeqIO.parse(fetch, "fasta"))
        fetched=True
        break
    except Exception as e: 
        print(e)
        #HTTP Error 500: Internal Server Error
        print('str used to generate web env and query key:  ',teststr)
        print('query_key used to generate web env and query key:  ',queryKey)
        fetched=False
        continue  

#I then write the id and seq for each seqRecord in object_new to a txt file in fasta format



Source link