gravatar for Kevin Blighe

2 hours ago by

Republic of Ireland

To determine an outlier is usually a judgement call and is something that comes with experience of having worked on dozens —possibly hundreds— of datasets.

The numbers on the PCA axes are unfortunately not a good metric to use on their own.


Stat ellipse

You could instead generate a stat ellipse at the 95% confidence level, as I do HERE, where an outlier would be any sample falling outside of it's respective group's ellipse:



You could also generate Z-scores from the PC1 values and determine an outlier as anything falling outside |Z|=3 or |Z|=6.


Hierarchical clustering

In a dendrogram, an outlier will lie in its own branch that may extend from the very root of the tree. You can again attempt to quantify these by setting cut-offs based on the distance metric that's used. For example, if a sample branches off into it's own leaf / node at a height of Euclidean Distance of 8, then it may be an outlier.

Take a quick look at what I do here: A: extract dendrogram cluster from pheatmap



  • Cook's Distance: Cook's Distance is a metric also routinely used in statistics.
  • +/- 1.5 * IQR: This is commonly used in statistics and there is much material online about it
  • Bonferroni test on studentised residuals: If you feel up for it, you can try to implement
    this, but it depends on your input data. I cannot really see it being
    used in your case -

Source link