gravatar for nattzy94

2 hours ago by

I am assembling a gtf file from a bam file which I generated by aligning my rnaseq reads using STAR. Assembly was done using StringTie and the Ensembl annotation file for GRCh38.

My problem is that the resulting gtf file does not contain all the information that is in the reference annotation. Crucially, it is missing information on transcript biotype which I am interested in.

For instance the reference annotation has the following fields for a transcript:

 1       havana  exon    12975   13052   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000450305"; transcript_version "2"; exon_number "4"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-201"; transcript_source "havana"; transcript_biotype "transcribed_unprocessed_pseudogene"; exon_id "ENSE00001799933"; exon_version "2"; tag "basic"; transcript_support_level "NA";

However, my assembled gtf file looks like this:

1       StringTie       exon    12613   12721   1000    +       .       gene_id "MSTRG.1"; transcript_id "ENST00000456328"; exon_number "2"; gene_name "DDX11L1"; ref_gene_id "ENSG00000223972";

I've also tried searching the entire file for "transcript_biotype" but nothing comes up.

From this previous post, I saw that a potential fix might be to convert the gtf to bed12 and then annotate the bed12 using the Ensembl annotation file. However, I'm not sure exactly which bedtools function to use.

Would be great if anyone could point to a different solution.


modified 2 hours ago

2 hours ago


Source link