I'm trying to find the longest ORFs in a fasta file containing a few genomic sequences. I've been working with the function in the biopython documents, but am not sure I quite understand the code. Any clarification would be helpful!
- Do I need to worry about iterating through nucleotides in groups of
3? Since this function translates into proteins, I don't think I do?
- For the standard code, translation table 1), why is the amino acid L being marked as a start codon?
(www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1) I was
under the impression only M (AUG/ATG) is a start codon.
- The code given in the python documents (biopython.org/DIST/docs/tutorial/Tutorial.pdf) does not
include any statements indicating that the ORF should start at a
Met/AUG/ATG. Why is this?
- I believe I am finding the longest ORF by finding the first M and going until reaching a *. I may be missing shorter reading frames in
this example by not looking for subsequent M downstream of the first
one. But I think that is okay given the question - to find the
longest ORFs. Is this reasoning correct?
table = 1Output of code listed below as image. min_pro_len = 100 ORF_f2 =  def find_orfs(filename): for record in SeqIO.parse(filename, "fasta"): for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]: for frame in range(3): # Splits on stop codons for pro in nuc[frame:].translate(table).split("*"): # Truncate sequence to start with the start codon pro = pro[pro.find('M'):] if frame == 2: # add length of sequence to list if in frame 2 ORF_f2.append(len(pro)) if len(pro) >= min_pro_len: print("%s...%s - length %i, strand %i, frame %i" % (pro[:30], pro[-3:], len(pro), strand, frame)) print(max(ORF_f2)) find_orfs('dna.example.fasta')
Output of code:
MSTFNRLHLIRQVDLFTLKLFVSAVEEQQI...PRD - length 305, strand 1, frame 2 MQPIVLGRLEIQKVVEMETSLPIEVLFPGV...TQV - length 294, strand 1, frame 2 MASSQGTVDFIVEQMAAAGTVSARKMFGEY...RRR - length 107, strand 1, frame 2 MSSFLCSRRMRDGRAHVSQNTPGRSATWHA...PAS - length 247, strand -1, frame 2 MAVLVQNVTLSLMKVSDRGTQNFSFSRPEN...IRR - length 128, strand -1, frame 2 MHRATGFERTPATHPVGGDRERIVQRPIPV...DDQ - length 535, strand 1, frame 2 MPTDDASAASIDAGGTRNAWRPGLGAHEPD...RTG - length 299, strand 1, frame 2 MRLSHERLEKDLLSKPTTLRDSITELRRLS...VHV - length 314, strand -1, frame 2