gravatar for antonioggsousa

2 hours ago by

Hi,

I'm collaborating in a project where the transcriptome of Arabidopsis thaliana was assembled with StringTie. Since the aim of the project is also to look into the long-non-coding RNA transcripts, just a few of them were kept.

I need to retrieve the correspondence between transcript_id and gene_id to use this information to perform some downstream analyses. It was given to me the GTF file (for the lncRNA) and I parsed the file in order to retrieve the transcript_id to gene_id correspondence with some R code.

When I started to look into the correspondence between the transcript_id to gene_id I found three types of ids:

  • 1) transcript_id as it appears in the genome annotation of A. thaliana, e.g., AT1G04003.1, mapped against the respective gene_id, i.e., AT1G04003. If I understood correctly this means that the assembled transcript predicted by StringTie exists in the current annotation of A. thaliana.

  • 2) novel transcript that does not appear in the current genome annotation of A. thaliana, e.g., with the transcript_id TCONS_00000010, mapped against the novel gene, i.e., with gene_id XLOC_000005. If I understood correctly this means that the assembled transcript predicted by StringTie does not exists in the current annotation of A. thaliana (and by extent neither the gene).

  • 3) transcript_id as it appears in the genome annotation of A. thaliana, e.g., AT1G04163.1, mapped against the gene_id MSTRG.236 given by the StringTie software instead of the expected AT1G04163 gene. Actually, there is a description field in the GTF for these cases named ref_gene_id that holds the A. thaliana gene identification AT1G04163.

My problem is to understand the 3rd case where StringTie keeps the classification MSTRG... in gene_id instead of ref_gene_id. From this post on the StringTie github repo I think that I can substitute the gene_id MSTRG... by the ref_gene_id but since I don't quite understand these notations I'm not sure. Can this mean that is a new isoform (so new gene_id with MSTRG...), but some of their transcripts are mapped against known transcripts in A thaliana genome, that's why they have transcript_id as it appears in the genome annotation of A. thaliana, e.g., AT1G04163.1 (that corresponds to ref_gene_id AT1G04163), but the overall true gene_id is new/novel (due to novel isoform), and therefore assigned as MSTRG... by StringTie?

I read the StringTie paper and also I checked the manual but I was not able to find a clear answer to this doubt that I have.

Thank you in advance for any help or suggestion,

António



Source link