gravatar for kmussar

3 hours ago by

Hi,

I'm trying to find the longest ORFs in a fasta file containing a few genomic sequences. I've been working with the function in the biopython documents, but am not sure I quite understand the code. Any clarification would be helpful!

My questions:

  • Do I need to worry about iterating through nucleotides in groups of
    3? Since this function translates into proteins, I don't think I do?
  • For the standard code, translation table 1), why is the amino acid L being marked as a start codon?
    (www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG1) I was
    under the impression only M (AUG/ATG) is a start codon.
  • The code given in the python documents (biopython.org/DIST/docs/tutorial/Tutorial.pdf) does not
    include any statements indicating that the ORF should start at a
    Met/AUG/ATG. Why is this?
  • I believe I am finding the longest ORF by finding the first M and going until reaching a *. I may be missing shorter reading frames in
    this example by not looking for subsequent M downstream of the first
    one. But I think that is okay given the question - to find the
    longest ORFs. Is this reasoning correct?

Code:

table = 1Output of code listed below as image. 
min_pro_len = 100
ORF_f2 = []

def find_orfs(filename):    
    for record in SeqIO.parse(filename, "fasta"):
        for strand, nuc in [(+1, record.seq), (-1, record.seq.reverse_complement())]:
            for frame in range(3):
                # Splits on stop codons
                for pro in nuc[frame:].translate(table).split("*"):
                    # Truncate sequence to start with the start codon
                    pro = pro[pro.find('M'):]
                    if frame == 2: 
                        # add length of sequence to list if in frame 2
                        ORF_f2.append(len(pro))
                        if len(pro) >= min_pro_len:
                            print("%s...%s - length %i, strand %i, frame %i" 
                                % (pro[:30], pro[-3:], len(pro), strand, frame))

    print(max(ORF_f2))

find_orfs('dna.example.fasta')

Output of code:

MSTFNRLHLIRQVDLFTLKLFVSAVEEQQI...PRD - length 305, strand 1, frame 2
MQPIVLGRLEIQKVVEMETSLPIEVLFPGV...TQV - length 294, strand 1, frame 2
MASSQGTVDFIVEQMAAAGTVSARKMFGEY...RRR - length 107, strand 1, frame 2
MSSFLCSRRMRDGRAHVSQNTPGRSATWHA...PAS - length 247, strand -1, frame 2
MAVLVQNVTLSLMKVSDRGTQNFSFSRPEN...IRR - length 128, strand -1, frame 2
MHRATGFERTPATHPVGGDRERIVQRPIPV...DDQ - length 535, strand 1, frame 2
MPTDDASAASIDAGGTRNAWRPGLGAHEPD...RTG - length 299, strand 1, frame 2
MRLSHERLEKDLLSKPTTLRDSITELRRLS...VHV - length 314, strand -1, frame 2

(truncated)



Source link