Interpretation of Principal component analysis

Hi all,

I understand that PCA, the PC1 and PC2 explain the two most variances of data points in a dataset.
Do we have any threshold for percentages of variances to conclude that the PC1 is 'good' or 'bad'? What is the relation of the percentage of variance with clustering?

In a particular case, if a dataset indicates that the 2 clusters clearly are separated with PC1 of ~20%)compared with 2 clusters that are not really separated but with PC1 of ~70%. Can we conclude that one is more trustable than the other?
enter image description here

Thanks for your help!



PCA is not a clustering technique - it's purpose is dimensionality reduction. In many cases data points after dimensionality reduction end up grouping in clusters so it is easier to see that they are related, but that's a secondary purpose of PCA. Just like the purpose of autoencoders is not clustering, but their latent representations are useful for clustering.

The more variance is explained by principal components, the better it serves its intended dimensionality reduction purpose. So if you have PC1 and PC2 explaining 20% and 15% of variance, that would be an inferior solution to PC1 and PC2 explaining 70% and 25%, respectively. In the former case you would need more than 2 PCs to confidently represent your original data, while in the latter PC1 & PC2 would be most likely enough. However, it could happen exactly as you showed that a solution with superior PCs (on the right) gives less clean clusters than an inferior PCA solution. That has to do with intrinsic separability of data points, or whether they are intrinsically clusterable if you will. Not all the data will give clearly separated clusters even when PCA is able to explain most or all variance with only 2 components.

before adding your answer.

Traffic: 2040 users visited in the last hour

Source link