How do you generate TMM normalized counts using EdgeR?

2

Hi guys,

First I just want to say, I know this has been asked numerous times and in a number of places. However my confusion has increased progressively!

If someone could set the record straight, please, on how to generate a TMM normalized counts using EdgeR, I would be incredibly grateful!

This post has two answers that look like disagree with one another... output TMM normalized counts with edgeR

The main reason I want to generate TMM matrix from my count matrix is to compare gene expression levels within samples (see the level of multiple cell-type markers within a single sample) in a bar plot, and then generate the same bar plot for every other respective sample to compare between samples.

I think using TMM normalized counts will allow me to compare between samples and within samples according to here: hbctraining.github.io/Training-modules/planning_successful_rnaseq/lessons/sample_level_QC.html

I plan to use something like this to generate the bar plots: Network/Pathway Analysis from Mass Spec data

Any help would be appreciated.

Thank you in advance 🙂


R


RNA-seq


edgeR

• 48 views

Sorry this has been confusing. This has also been a regular source of frustration for the edgeR authors as well because we have been saying the same thing for a decade:

If you want to export normalized expression values out of edgeR, just use cpm or rpkm.

The root of the confusion is that there is no such thing as a "TMM normalized count" because TMM normalizes the library sizes rather than the counts. And I have always resisted pressure to use the term "normalized count" in the edgeR documentation because a normalized value can no longer be a count . I prefer to use more descriptive and specific terms like cpm or rpkm. I know that other software tools refer to "normalized counts" but I find that unhelpful. Normalized for what?

TMM normalizes the library sizes to produce effective library sizes. cpm values are counts normalized by the effective library sizes. rpkm values are counts normalized by effectlive library sizes and by gene/feature length.

A second source of confusion is that people seem to assume that edgeR must be storing "normalized counts" internally somehow, but it does not. Most edgeR DE pipelines never modify the original counts in any way. Normalization for library size is instead implicit as part of the model-fitting. edgeR does not use cpm or rpkm values internally in its DE pipelines, rather they are only for export or for graphical purposes.

A third source of confusion is that the original edgeR pipeline (now called the "classic" pipeline) did compute pseudo.counts internally, which are equivalent to the original counts but with equalized effective library sizes. We did not intend or recommend that users would export these as normalized values but some have done so.

Example of posts by the edgeR authors:

The default normalization in edgeR can be broken down to two steps:


1) normalization by library size. That is simply the correction for
read depth. While this may probably be good enough when there are no
widespread changes in library compisition (=samples are very similar
and only very few genes are differential), this often is not good
enough. See for an example my answer here (TMM-Normalization) using GTEx data where I
compare pancreas and lung transcriptomes, so one would expect
notably different gene expression profiles. As you'll see plain per-million scaling results in biased normalized counts while TMM manages to properly center the bulk of genes at y=0 in the MA-plot.


2) the introduction of normalization factors that correct the library
size-scaled values for the compositional component. This here is
what the Trimmed Mean of M-values (TMM) does. For technical details
see the original paper by Robinson & Oshlack in Genome Biology from
2010
.


Points 1) and 2) are then combined to calculate the effective library size which is then used to divide the raw counts by to obtain normalized counts, also often referred to as TMM-normalized counts or cpm.

In practice:

#/ make the DGEList:
y <- DGEList(...)

#/ calculate TMM normalization factors:
y <- calcNormFactors(y)

#/ get the normalized counts:
cpms <- cpm(y, log=FALSE)

The cpm function uses the normalization factors (given that calcNormFactors was run on that DGEList) internally.
If not, then cpm just return the plain per-million scaled factors.


Login
before adding your answer.

Traffic: 1299 users visited in the last hour



Source link