Hello,

To begin with, sorry for the long post. I just want to give all the information I think is relevant to the problem I am having. I am newish to bioinformatics, so I apologize if my coding is not the best. If you read all the way through, thank you! I am trying to filter a GenBank file with a list of gene ids using Biopython's SeqIO. I have done this previously with no problem. I am not sure if there is an issue with this particular GenBank file, or if new updates to Biopython have caused the issues. Here is my code below:

genelist_records = []
gb_file1 = "genbank_file.gk"

for inrecord in SeqIO.parse(open(gb_file1,"r"), "genbank"):
    name = inrecord.name
    if name in geneslist_exons:
        genelist_records.append(inrecord)

    else:
        pass

with open ("genbank_file_filtered.gb", "w") as outfile:
    SeqIO.write(genelist_records,outfile, "genbank")

Add these are the errors I receive:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-8f0d4cd86deb> in <module>
     18 
     19 with open ("genbank_file_filtered.gb", "w") as outfile:
---> 20     SeqIO.write(genelist_records,outfile, "genbank")
     21 
     22 #to get the number of gene ids that are over our threshold to make sure we have enough. (we want ~1100)

~/miniconda3/envs/training3/lib/python3.6/site-packages/Bio/SeqIO/__init__.py in write(sequences, handle, format)
    489         if format in _FormatToWriter:
    490             writer_class = _FormatToWriter[format]
--> 491             count = writer_class(fp).write_file(sequences)
    492         elif format in AlignIO._FormatToWriter:
    493             # Try and turn all the records into a single alignment,

~/miniconda3/envs/training3/lib/python3.6/site-packages/Bio/SeqIO/Interfaces.py in write_file(self, records)
    212         """
    213         self.write_header()
--> 214         count = self.write_records(records)
    215         self.write_footer()
    216         return count

~/miniconda3/envs/training3/lib/python3.6/site-packages/Bio/SeqIO/Interfaces.py in write_records(self, records)
    197         count = 0
    198         for record in records:
--> 199             self.write_record(record)
    200             count += 1
    201         # Mark as true, even if there where no records

~/miniconda3/envs/training3/lib/python3.6/site-packages/Bio/SeqIO/InsdcIO.py in write_record(self, record)
    807         """Write a single record to the output file."""
    808         handle = self.handle
--> 809         self._write_the_first_line(record)
    810 
    811         default = record.id

~/miniconda3/envs/training3/lib/python3.6/site-packages/Bio/SeqIO/InsdcIO.py in _write_the_first_line(self, record)
    616             # Must be something like NucleotideAlphabet or
    617             # just the generic Alphabet (default for fasta files)
--> 618             raise ValueError("Need a Nucleotide or Protein alphabet")
    619 
    620         # Get the molecule type

ValueError: Need a Nucleotide or Protein alphabet

I have tried to go back and assign both an alphabet and a molecular type to the original GenBank file with this code:

from Bio.Alphabet import generic_dna
from Bio.Seq import Seq


genelist_records = []
gb_file1 = "genbank_file.gk"

for inrecord in SeqIO.parse(open(gb_file1,"r"), "genbank"):
    name = inrecord.name
    sequence = Seq(str(inrecord.seq), generic_dna) 
    annotation={"molecule_type":"DNA"}
    genelist_records.append(inrecord)

with open ("Trachinotus_ovatus_filtered_genes_mol_type.gb", "w") as outfile:
    SeqIO.write(genelist_records,outfile, "genbank")

However, I get the exact same errors as above.

I have seen in the latest updates where Biopython has replaced Bio.Alphabet, so I tried adding this to my code (which I found here- biopython.org/wiki/Alphabet):

try:
    from Bio.Alphabet import Alphabet
except ImportError:
    Alphabet = "DNA"

And still get the same error. I am really confused and frustrated and am not sure what next steps to take. When looking up a record in my Genbank file, it says there is an alphabet assigned.

ID: LG1_667828-726403
Name: LG1_667828-726403
Number of features: 3
Seq('GAAATATGCTTCAGAACATAACTTCATGTGGAGAGGTGCAGTCAAAGGCATCTG...CTG', Alphabet())

Does anyone have any ideas or suggestions on how to fix this so I can write a GenBank file? Or another method I can use to filter a GenBank file with a list of LOCUS ids?

Thank you for reading and any help!



Source link