I have 160 samples RNA-seq data (90 tumor an 50 normal samples). We used Ribo-depletion approach (Ribo-Zero Gold rRNA Removal Kit). Our main goal is to detect novel lncRNAs from the RNA-seq data. From previous research papers It is seen that lncRNAs are highly detected with Ribo-depletion approach.

Initially, I did Fastqc for all the samples and did the alignment. Mapping rate is seen very good almost 85-90%. It is paired-end and strand-specific data. After alignment I did QC and observed almost

70% reads were in intronic regions,
15-20% reads in exonic regions,
and rest is intergenic

I visualized the bam files in IGV and I see most of the reads in intronic regions of genes.

I have also checked multiple posts, tutorials and some papers in which they mentioned:

there can be DNA contamination (homogeneous presence of reads in both introns and intergenic); or presence of immature RNA (more intronic, less intergenic)

The intronic reads either can be independent transcripts or were immature transcripts that had not been spliced. Immature transcripts could include either full-length pre-mRNA molecules or nascent transcripts where the RNA polymerase had not yet attached to the 3′ end of the gene - according to this paper

I have very little experience with lncRNAs and Ribo-zero. So, can anyone tell me whether - Is it good to use this data (higher immature transcripts) now for novel lncRNA detection? Any other ideas how to go further to achieve my goal?


