I am analyzing the GTF file at ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz and try to understand gene structure. Based on observation, I summarized the following relationship

  1. UTR is part of exon
  2. CDS is part of exon
  3. start codon is part of CDS, hence part of exon, too
  4. stop codon is neither part of UTR or part of CDS, but it's still part of exon.

Therefore, given a transcript id, if I sum the length of each type of sequences, the following relationship should hold:

L_{exon} = L_{CDS} + L_{UTR} + L_{stop_codon}

I am only considering sequences whose source is protein_coding. After I assert this relationship to all 90273 transcript ids in the gtf file, it holds for 99.85% of transcripts. For the remaining 0.15% or 113 transcripts, it doesn't hold with the left side off by 1.

When I look into several of the 113 anomaly cases closely, at least for the 4 cases I have looked into, the relationship doesn't hold for the same reason. The 4 cases all have split stop codons, meaning part of the stop codon is in one exon (e.g. 2 bases), and the rest is in another exon (e.g. 1 base). Strangely, the first 2 bases don't count as part of CDS but the the 1-base part counts, which doesn't quite make sense to me. Can it be an error in the gtf file, please?

Below, I pasted a concrete example with the problematic region highlighted.

enter image description here

The entries are sorted by the start column. The sum of the lengths of all elements are

CDS            1993 
UTR            2363 
exon           4358 
start_codon    3    
stop_codon     3

Applying the above formula, the left side is 4358, the right side is 1993 + 2363 + 3 = 4359, and they DON'T match.

Source link