How to normalize and transform RNA-seq data of different samples for PCA?


Hi Biostars Community,

I have searched the forum, but couldn't find a perfect answer for this.

I have downloaded some RNA-seq data from GEO (GSE112656). They are basically 11 osteoarthritis samples and 11 rheumatoid arthritis (so 22 in total).

I also have 4 samples of RNA-seq data generated from my lab with three technical replicates (so 12 total).

  1. Drug A - treated
  2. Drug A - control (untreated)
  3. Drug B - treated
  4. Drug B - control (untreated)

I want to use the GEO (GSE112656) and also my lab data to conduct PCA to see which ones are similar and cluster together and which ones are different.

How would I normalize and transform these datasets before following these directions:

I was considering following this post:
Which counts to use for RNA-seq heatmap and PCA?

Basically, I am planning to combine all the samples (22 +12 = 32 samples) into one large data frame and then generate log2 transformed TMM followed by throwing that table into a PCA function to visualize the PC_1 by PC_2 table, or maybe PC_1 by PC_2 by PC_3 to seeing how they cluster/how similar or different they are to one another. I am planning TMM because HBC recommends this to compare between samples and within samples. Would all this be best practices?

Please, if you can suggest something better, I would appreciate it.






Source link