I have a data normalization question for some RNAseq data that I'd like to apply CPM-TMM normalization to.

Say I have two sequencing batches

Batch 1: n = 9 biological replicates, with 2 technical replicates/sample + several other samples and respective technical replicates I do not care about

Batch 2: n = 20 biological replicates, with 5 technical replicates/sample + several other samples and respective technical replicates I do not care about

Batch 1 and Batch 2 were sequenced separately, but I only care about the n = 9 in Batch 1 and n = 20 in Batch 2 biological replicate samples for my downstream analysis

My plan was to group the raw counts for the n = 9 and n = 20 biological replicates (plus their technical replicates, 118 in total), and then compute the TMM scaling factors for those 118 samples (only on that joint data frame alone, since I'm only interested in comparing the 118 samples, none of the other samples from the other two batches. Then I'd just average the CPM-TMM counts for a given biological replicate across the technical replicates. Is this the right thing to do?

Or should I compute TMM scaling factors for Batch 1 and Batch 2 separately? My understanding was that TMM is important in that it accounts for inherent biases based on biological conditions of a given sample, and that it might skew the resulting comparisons in expressed genes (like DE, though I'm not doing that), so it's good only to compute for the samples you're interested in comparing between. Therefore computing TMM scaling factors for Batch 1 and Batch 2 separately would not be a good idea prior to comparison.

Is this normalization strategy reasonable (eg get TMM scaling factors only for the 118 samples)? I want to make sure I'm understanding the TMM paper correctly.



Source link