I'm collaborating in a project where the transcriptome of Arabidopsis thaliana was assembled with StringTie. Since the aim of the project is also to look into the long-non-coding RNA transcripts, just a few of them were kept.
I need to retrieve the correspondence between transcript_id and gene_id to use this information to perform some downstream analyses. It was given to me the GTF file (for the lncRNA) and I parsed the file in order to retrieve the transcript_id to gene_id correspondence with some
When I started to look into the correspondence between the transcript_id to gene_id I found three types of ids:
1) transcript_id as it appears in the genome annotation of A. thaliana, e.g.,
AT1G04003.1, mapped against the respective gene_id, i.e.,
AT1G04003. If I understood correctly this means that the assembled transcript predicted by StringTie exists in the current annotation of A. thaliana.
2) novel transcript that does not appear in the current genome annotation of A. thaliana, e.g., with the transcript_id
TCONS_00000010, mapped against the novel gene, i.e., with gene_id
XLOC_000005. If I understood correctly this means that the assembled transcript predicted by StringTie does not exists in the current annotation of A. thaliana (and by extent neither the gene).
3) transcript_id as it appears in the genome annotation of A. thaliana, e.g.,
AT1G04163.1, mapped against the gene_id
MSTRG.236given by the StringTie software instead of the expected
AT1G04163gene. Actually, there is a description field in the GTF for these cases named ref_gene_id that holds the A. thaliana gene identification
My problem is to understand the 3rd case where StringTie keeps the classification
MSTRG... in gene_id instead of ref_gene_id. From this post on the StringTie github repo I think that I can substitute the gene_id
MSTRG... by the ref_gene_id but since I don't quite understand these notations I'm not sure. Can this mean that is a new isoform (so new gene_id with
MSTRG...), but some of their transcripts are mapped against known transcripts in A thaliana genome, that's why they have transcript_id as it appears in the genome annotation of A. thaliana, e.g.,
AT1G04163.1 (that corresponds to ref_gene_id
AT1G04163), but the overall true gene_id is new/novel (due to novel isoform), and therefore assigned as
MSTRG... by StringTie?
Thank you in advance for any help or suggestion,