What would be prefered as input, the raw counts or the DEG?

The input to that tutorial is raw counts, which then undergo normalisation. All clustering algorithms that are then applied are based on the Z-transformed (by row/gene) CPM+0.25 values, as per these lines:

z <- cpm(y, normalized.lib.size=TRUE)

scaledata <- t(scale(t(z))) # Centers and scales data.

scaledata is then used for clustering

If you want to then use the DEGs, please just filter the scaledata object to only comprise the DEGs, and then re-do clustering. For example:

degs <- c('ATM','ERBB2','ERBB3','BRCC3')

scaledata.filt <- scaledata[degs,]

Furthermore how could I make a dotplot of the genes and the clusters,
similar to this dotplot in this thread? How to make k-means clustering
plot for relative expression?

It may help that you clarify specifically what you are visualising in your head. While those figures may look colourful and 'nice', what they say is important for most non-sensationalistic journals. Is it:

  • plot of a single gene's expression per cluster?
  • plot of a summarised 'score' per cluster?
  • plot of a summarised score per gene per cluster (k-means center or PAM medoid?

...what do you want to show?


Source link