Hello everyone!

I have written a very simple code to generate Proteins from DNA/RNA by generating 6 reading frames and matching all 6 against the DNA/RNA Codons table:


DNA Codon table I use (from Wikipedia article about codons):

# 'M' - START, '_' - STOP

DNA_Codons = {
"GCT": "A", "GCC": "A", "GCA": "A", "GCG": "A",
"TGT": "C", "TGC": "C",
"GAT": "D", "GAC": "D",
"GAA": "E", "GAG": "E",
"TTT": "F", "TTC": "F",
"GGT": "G", "GGC": "G", "GGA": "G", "GGG": "G",
"CAT": "H", "CAC": "H",
"ATA": "I", "ATT": "I", "ATC": "I",
"AAA": "K", "AAG": "K",
"TTA": "L", "TTG": "L", "CTT": "L", "CTC": "L", "CTA": "L", "CTG": "L",
"ATG": "M",
"AAT": "N", "AAC": "N",
"CCT": "P", "CCC": "P", "CCA": "P", "CCG": "P",
"CAA": "Q", "CAG": "Q",
"CGT": "R", "CGC": "R", "CGA": "R", "CGG": "R", "AGA": "R", "AGG": "R",
"TCT": "S", "TCC": "S", "TCA": "S", "TCG": "S", "AGT": "S", "AGC": "S",
"ACT": "T", "ACC": "T", "ACA": "T", "ACG": "T",
"GTT": "V", "GTC": "V", "GTA": "V", "GTG": "V",
"TGG": "W",
"TAT": "Y", "TAC": "Y",
"TAA": "_", "TAG": "_", "TGA": "_"
}

I have tested this on many sequences from NCBI and my simple reading framers -> translation code works just fine for the most part. I found a few odd sequences on NCBI and one of them is this:

www.ncbi.nlm.nih.gov/nuccore/JF909299.1

>JF909299.1 Homo sapiens insulin (INS) mRNA, partial cds
CTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTC
TACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGG
TGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCT
GCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGC
AACTA

The expected (as per NCBI) translated sequence is this:

/translation="WGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"

If we generate 6 reading frames for JF909299.1 (based on the standard codon table) we will get this:

  1. LGT_PSRSLCEPTPVRLTPGGSSLPSVRGTRLLLHTQDPPGGRGPAGGAGGAGRGPWCRQPAALGPGGVPAEAWHCGTMLYQHLLPLPAGELLQL
  2. WGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
  3. GDLTQPQPL_TNTCAAHTWWKLST_CAGNEASSTHPRPAGRQRTCRWGRWSWAGALVQAACSPWPWRGPCRSVALWNNAVPASAPSTSWRTTAT
  4. _LQ_FSSW_REQMLVQHCSTMPRFCRDPSRAKGCRLPAPGPPPSSTCPTCRSSASRRVLGV_KKPRSPHTR_RASTRCEPHRCWFTKAAAGSGPQ
  5. SCSSSPAGRGSRCWYSIVPQCHASAGTPPGPRAAGCLHQGPRPAPPAPPAGPLPPGGSWVCRRSLVPRTLGRELPPGVSRTGVGSQRLRLGQVP
  6. VAVVLQLVEGADAGTALFHNATLLQGPLQGQGLQAACTRAPAQLHLPHLQVLCLPAGLGCVEEASFPAH_VESFHQV_AAQVLVHKGCGWVRSP

And as we can see, the second reading frame is the expected translation as per NCBI, but that means that:
TGG -> W and it was used as a start codon. But TGG/UGG (W) is not a start codon?

I have searched the net and did not find any information about a case where TGG/UGG could be the start codon. This is supposed to be a standard Homo sapiens insulin (INS) mRNA. Another example is this online tool example link, that produces exactly the same result as my code (found proteins are marked red).


Can anyone please help me to understand this particular case? Why NCBI database tells us that TGG/UGG can be a start codon and not a standard ATG/AUG? This basically breaks my code, and I am looking to update it to support this logic as soon as I understand it.


Kind regards to this amazing community.



Source link