I have run my BAM files through Rsubread (ie using featurecounts) and then through DESeq2 for read normalisation, however, when I choose to count by features (ie by exon not gene) I get multiple rows returned for the same transcript ID:

So, taking the ensembl gene IDs from my normalised counts table as the keys for mapID function in annotationdbi to return the gene symbol, txid, txbiotype, etc:

normalized_counts_new2$SYMBOL <- mapIds(EnsDb.Hsapiens.v86, keys = normalized_counts_new2$GeneID, column = "SYMBOL", keytype = "GENEID", multiVals = "first")
normalized_counts_new2$TXID <- mapIds(EnsDb.Hsapiens.v86, keys = normalized_counts_new2$GeneID, column = "TXID", keytype = "GENEID", multiVals = "first")

The results of the first gene in my table shows that this function has called all the rows the same ensembl transcript IDs, but for this gene we would expect two transcripts (DDX11L1 ensembl page). I guess to be expected using multivals = "first" but I can't set this to list or characterlist without getting an error for mismatch between number of rows between replacement and data
mapID function results

So, then I tried using a select function in place of multiple mapIDs. This gives me the result I would expect, but I can't directly write this output to multiple new columns in the counts table. Of course I thought about joining the normalised counts table and the newly generated annotation table but the problem is that the number of rows in the counts table is 1182163 and in the select function results is 198002, so I wouldn't know which annotations join to which rows:

normalized_counts_new3 <- normalized_counts_new
normalized_counts_new3_annotations <- AnnotationDbi::select(EnsDb.Hsapiens.v86, keys=normalized_counts_new3$GeneID, columns = c("SYMBOL", "GENEBIOTYPE", "GENENAME", "TXID", "TXBIOTYPE"), keytype = "GENEID", multiVals = "list")

select function results

SO, my Qs:

  • Why am I getting multiple rows for the same transcript generated during subread?

  • Am I supposed to consolidate the counts on different rows with duplicate transcriptIDs...? seems odd if so

  • How can I annotate these data properly? The select function gives the expected different transcripts but can't be joined to the counts table as the number of rows is 6 fold less

Source link