Hope everyone is doing great!
I have a question about TMM normalization(pardon me if it seems a simple question as I am still a newbie to this kind of analysis).
I want to analyze a publically available RNA-seq dataset(GSE114686). The RNA-seq data is however not raw counts but already pre-processed using edgeR. The description of the pre-processing is a bit confusing for me.
In the details of this pre-processing, the researchers mentioned the following :
"Sequenced reads were trimmed for adaptor sequence, and masked for low-quality sequence, then mapped to hg19 genome using STAR in bcbio nextgen package with parameters: analysis: RNAseq, aligner: star, trim_reads: read_through, adapters: [truseq, polya], quality_format: standard, strandedness: firststrand.
Based on the feature count table generated from STAR alignment, we did differential expression analysis using edgeR. A filtering criterion of mean Count Per Million (CPM) > 1 of each time point in the vehicle-only controls was used, resulting in 13,227 transcripts, in the ProcessedData.csv.
The count table was generated from edgeR analysis after TMM normalization
Supplementary_files_format_and_content: [ProcessedData.csv] TMM-normalized feature counts from STAR alignment"
My questions are :
1) Based on the aforementioned details, is the data provided really normalized? Or was the normalization only used for filtering, and then raw counts were provided(I don't think the latter is true)? This concern comes to me because when I plotted the log CPM per sample without normalization(method=TMM), I find the boxplots a bit weird(ibb.co/rG4F65S) as I was expecting to see uniform expression distribution in these boxplots assuming this dataset is actually in the TMM scale.
2) If my concern is not true and these boxplots represent the normalized expression distributions, can I then just proceed using limma or edgeR without applying cpm function from edgeR, or are there any precautions to take when analyzing NOT raw data?
Thank you very much in advance!
dg_list<-DGEList(counts = sora_df[,2:ncol(sora_df)],genes=sora_df[,1]) geneIDs<-sora_df[,1] cpm_log2<-cpm(dg_list,log=T) cpm_log2<-as.tibble(cpm_log2,rownames="geneIDs") df.pivot <- pivot_longer(cpm_log2, # dataframe to be pivoted cols = D1A1:S4C3, # column names to be stored as a SINGLE variable names_to = "samples", # name of that new variable (column) values_to = "expression") # name of new variable (column) storing all the values (data) # plotting effect of cleaning ggplot(df.pivot) + aes(x=samples, y=expression, fill=samples) + geom_boxplot(trim = FALSE, show.legend = FALSE) + stat_summary(fun = "median", geom = "point", shape = 95, size = 10, color = "black", show.legend = FALSE) + labs(y="log2 expression", x = "sample", title="Log2 Counts per Million (CPM)", subtitle="Filtered, TMM normalized??") + theme_bw()+coord_flip()