I've done denovo transcriptome assembly using Stringtie with paired end RNA-seq reads coming from a mid/large size experiment (60+ samples) conducted on rabbit: I've followed developers' protocol, so alignment+assembly+merging. Now, from the GTF, I thought about two possible ways:

  1. Quantification via Stringtie using the denovo merged GTF file: then using authors' script, convert samples quantification into a single gene-level matrix to use with DESeq2.
  2. Using the merged GTF, building a transcriptome using gffread and then do alignment/mapping against the transcriptome: import counts and analyze with DESeq2

I want to ask if both seems reasonable and, if not, which is better.
I've got some trouble when trying to implement them:

With the first approach: the same gene was present multiple times while doing DESeq2::results

> n_occur <- data.frame(table(rownames(countData)))
> dim(n_occur)  
[1] 39722     2
>n_occur[n_occur$Freq > 1,]
[1] Var1 Freq
<0 rows> (or 0-length row.names)

So no duplicates seems to be present.

dds <- DESeqDataSetFromMatrix(countData, colData, design=~0 + group)
res <- results(dds, contrast = c("group", "D31", "NT0"), alpha = 0.05, 
               pAdjustMethod = "BH")
# lfc shrinkage to have best results. 
res_lfcsh <- lfcShrink(dds, contrast = c('group', 'D31', 'NTO'), res = res, type = "ashr")

Visualizing results

"","baseMean","log2FoldChange","lfcSE","stat","pvalue","padj"
"ENSOCUG00000034460",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA
"ENSOCUG00000034460.1",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA
"ENSOCUG00000034460.2",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA

These "genes", like ENSOCUG00000034460.1, are duplications and are not present in original gene count matrix obtained via prepDE. Where does they come from? They appear many times (200-300 and more).

With the second: is it possible to create a linkedTxome to use with Salmon with the denovo GTF and the transcriptome generated via gffread? Is it necessary to build a TxDb to analyze Salmon data?

Thanks and sorry for the multiple questions.



Source link