I am a newby for RNA-seq and now trying to analysing a transcriptome of two tree species: one with ref. genome and another one without ref. genome. I have some questions that I keep thinking again and again and I hope you can help me to address them.

  1. Can you explain me why a ref. genome + gtf/gff3 is preferred for differential gene expression analysis? Why don't we extract transcripts from gtf/gff3, make a list of transcript in the genome, and use it as in de novo transcriptome assembly? I understand that computational-wise it would be easier (and less resource-intensive?) for a read mapper to map reads back to a list of transcripts than to a genome?

  2. In addition to rRNA, I usually map reads to a ref. genome to eliminate potential contaminations from plant-associated microbes, but I can't do that for the genome-free species and that I will have to do de novo transcriptome assembly. For now, I think of using all reads for de novo assembly/differential gene expression analysis before blasting those with significant expression to screen for contaminations. Do you think this would be okay? Or should I try to use Blobtools to remove contaminations before de novo assembly? This could be overkill, but I don't really know how many reads from the germs are made through. 🙁

Thank you very much for your help in advance! All comments and suggestions are welcome!

