Assessing read support for coding sequences predicted by TransDecoder

1

TransDecoder can be used to predict coding sequences (CDS) (and eventually also translate them into amino acid sequences) from nucleotide fasta files. TransDecoder predicts more than one CDS per sequence, as it translates in all 6 frames and also predicts incomplete CDS.

I have a de novo assembled transcriptome from which I have predicted a set of CDS, and I am interested in assessing the read support for these sequences.

My question is: would it be a mistake to try and map the entire set of reads to the CDS file to assess read support here? Some incomplete, unformulable intuition tells me that using the entire read set could lead to spurious read support as reads that were not used in constructing the parent transcript can end up mapping to a derived CDS by chance. I also "feel" (not think) that the multimapping rate would be quite high because of the many-to-one relationship between the transcripts and the (therefrom derived) CDS.

If I should not use the entire read set, then this leads to my second question: how do I subset the set of reads that were actually used to construct the transcripts? Can I first map all reads against the assembly, select only those reads that did map against the assembly, and then try and map these against the CDS file(s)?

The ultimate objective of this is to try and somehow "rank" the various CDS for each transcript based on read support. (Comments on this are also welcome.)

Your inputs would be much appreciated.

Edit: I suppose I must mention that this post sort of follows up on these two posts, respectively.


cds prediction


transdecoder


de novo assembly

• 229 views

updated 1 hour ago by

▴

60

written 3 months ago by

▴

520

Hi,

did you come to a conclusion yet? I'm currently wondering the same thing. Conducted de novo transcriptome assembly, but noticed that some contigs are basically concatenated genes (valid CDS, with BLAST hits, separated by 5' and 3' UTR). Transdecoder was capable of disentangling those CDS. Wouldn't it be more precise to quantify with CDS instead of assembled transcripts?

When using Salmon with the CDS instead of assembled transcript as reference, read support dropped dramatically (from 99 % to 70 %). Subsequent analyses using DESeq2 did not show any difference in data quality (dispersion plots for assembled transcripts & CDS were almost identical).

My conclusion is, that chimeric transcripts are chimeric for a reason: they seem to be assembled together due to overlapping read support. I could be entirely wrong though, and I'd be glad if someone has a qualified answer! It's an interesting topic.

Hope this helps, sorry if it doesn't!

Lukas


Login
before adding your answer.

Traffic: 1407 users visited in the last hour



Source link