RNA-seq de novo assembly alignment quality assessment



I have an RNA-seq dataset from a non-model organism and need to do de novo assembly. After running Trinity, I checked the quality of the assembled transcriptome by aligning back the reads using both bowtie and bowtie2. The problem is that the "aligned concordantly exactly 1 time" percentage is very low.
For instance from bowtie2:

40191191 reads; of these:
  40191191 (100.00%) were paired; of these:
    1107263 (2.75%) aligned concordantly 0 times
    811254 (2.02%) aligned concordantly exactly 1 time
    38272674 (95.23%) aligned concordantly >1 times
    1107263 pairs aligned concordantly 0 times; of these:
      3136 (0.28%) aligned discordantly 1 time
    1104127 pairs aligned 0 times concordantly or discordantly; of these:
      2208254 mates make up the pairs; of these:
        585954 (26.53%) aligned 0 times
        74874 (3.39%) aligned exactly 1 time
        1547426 (70.07%) aligned >1 times
99.27% overall alignment rate

The same trend was observed when using busco to check the quality of assembly:


I have used BBduk and Trimmomatic to remove adapters and low quality reads. For cleaned reads, fastqc confirmed removal of adapters and per base quality but failed “Sequence Duplication Levels”. I assumed this is very common with RNA-seq data (?).

Now, I wanted to know if someone can suggest what improvement can be done in this case? Can I continue with this assembly? My ultimate plan is to run corset to remove the redundancies, use RSEM to quantify the expression levels and then performing DE analysis.
Thanks for the help in advance!






I assume you are interested in gene-level DGE.

What kind of genome does the organism have? You may have homeologs here, which may have resulted in many duplicated sequences. This is indicated by your BUSCO-result. That's not bad per se, but you need to carefully choose your quantification tools. If you want to infer DGE, go the gene-level route (annotation/clustering + tximport). Also, do not use bowtie2 for quantification. It may be tempting, but Salmon/Kallisto are capable of statistically partition read counts over similar transcripts and have lower computational cost.
I'd be happy if you could post your way forward, once you have decided!

Best of luck.

Edit: elaborated a bit more on quasi-mappers.

Edit2: sequence duplication levels are nothing to go by in RNA-Seq. I'd completely ignore them. Use Transrate to evaluate your assembly. For Transrate, you can download the Oyster River Protocol and use the packaged version (Transrate-ORP).

Edit3: If you find low Transrate Score (< 0.15), try and compare other assemblers.

before adding your answer.

Traffic: 1381 users visited in the last hour

Source link