Hello everyone!
I have written a very simple code to generate Proteins
from DNA/RNA
by generating 6 reading frames and matching all 6 against the DNA/RNA Codons
table:
DNA Codon table I use (from Wikipedia article about codons):
# 'M' - START, '_' - STOP
DNA_Codons = {
"GCT": "A", "GCC": "A", "GCA": "A", "GCG": "A",
"TGT": "C", "TGC": "C",
"GAT": "D", "GAC": "D",
"GAA": "E", "GAG": "E",
"TTT": "F", "TTC": "F",
"GGT": "G", "GGC": "G", "GGA": "G", "GGG": "G",
"CAT": "H", "CAC": "H",
"ATA": "I", "ATT": "I", "ATC": "I",
"AAA": "K", "AAG": "K",
"TTA": "L", "TTG": "L", "CTT": "L", "CTC": "L", "CTA": "L", "CTG": "L",
"ATG": "M",
"AAT": "N", "AAC": "N",
"CCT": "P", "CCC": "P", "CCA": "P", "CCG": "P",
"CAA": "Q", "CAG": "Q",
"CGT": "R", "CGC": "R", "CGA": "R", "CGG": "R", "AGA": "R", "AGG": "R",
"TCT": "S", "TCC": "S", "TCA": "S", "TCG": "S", "AGT": "S", "AGC": "S",
"ACT": "T", "ACC": "T", "ACA": "T", "ACG": "T",
"GTT": "V", "GTC": "V", "GTA": "V", "GTG": "V",
"TGG": "W",
"TAT": "Y", "TAC": "Y",
"TAA": "_", "TAG": "_", "TGA": "_"
}
I have tested this on many sequences from NCBI
and my simple reading framers -> translation
code works just fine for the most part. I found a few odd sequences on NCBI
and one of them is this:
www.ncbi.nlm.nih.gov/nuccore/JF909299.1
>JF909299.1 Homo sapiens insulin (INS) mRNA, partial cds
CTGGGGACCTGACCCAGCCGCAGCCTTTGTGAACCAACACCTGTGCGGCTCACACCTGGTGGAAGCTCTC
TACCTAGTGTGCGGGGAACGAGGCTTCTTCTACACACCCAAGACCCGCCGGGAGGCAGAGGACCTGCAGG
TGGGGCAGGTGGAGCTGGGCGGGGGCCCTGGTGCAGGCAGCCTGCAGCCCTTGGCCCTGGAGGGGTCCCT
GCAGAAGCGTGGCATTGTGGAACAATGCTGTACCAGCATCTGCTCCCTCTACCAGCTGGAGAACTACTGC
AACTA
The expected (as per NCBI
) translated sequence is this:
/translation="WGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
If we generate 6 reading frames
for JF909299.1
(based on the standard codon table) we will get this:
- LGT_PSRSLCEPTPVRLTPGGSSLPSVRGTRLLLHTQDPPGGRGPAGGAGGAGRGPWCRQPAALGPGGVPAEAWHCGTMLYQHLLPLPAGELLQL
- WGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN
- GDLTQPQPL_TNTCAAHTWWKLST_CAGNEASSTHPRPAGRQRTCRWGRWSWAGALVQAACSPWPWRGPCRSVALWNNAVPASAPSTSWRTTAT
- _LQ_FSSW_REQMLVQHCSTMPRFCRDPSRAKGCRLPAPGPPPSSTCPTCRSSASRRVLGV_KKPRSPHTR_RASTRCEPHRCWFTKAAAGSGPQ
- SCSSSPAGRGSRCWYSIVPQCHASAGTPPGPRAAGCLHQGPRPAPPAPPAGPLPPGGSWVCRRSLVPRTLGRELPPGVSRTGVGSQRLRLGQVP
- VAVVLQLVEGADAGTALFHNATLLQGPLQGQGLQAACTRAPAQLHLPHLQVLCLPAGLGCVEEASFPAH_VESFHQV_AAQVLVHKGCGWVRSP
And as we can see, the second reading frame is the expected translation as per NCBI
, but that means that:
TGG -> W
and it was used as a start
codon. But TGG/UGG (W)
is not a start codon?
I have searched the net and did not find any information about a case where TGG/UGG
could be the start codon. This is supposed to be a standard Homo sapiens insulin (INS) mRNA
. Another example is this online tool example link, that produces exactly the same result as my code (found proteins are marked red).
Can anyone please help me to understand this particular case? Why NCBI
database tells us that TGG/UGG
can be a start codon and not a standard ATG/AUG
? This basically breaks my code, and I am looking to update it to support this logic as soon as I understand it.
Kind regards to this amazing community.