I'm not sure the hypergeometric test is the best approach here. The proportion of cells in one cluster is dependent on the proportion of cells in all other clusters, so you could for example see depletion of all other cell types if there is a proportionally large increase in just one cell type. This is often called compositional bias. Furthermore, differences in sequencing depth could also lead to differences in cell proportions, since smaller populations would experience higher variance at varying sequencing depths. You really want to model or approximately model this as a multinomial distribution with additional statistical considerations.

Various methods of note have been developed to deal with this problems. In [Bioconductor OSCA](https://bioconductor.org/books/release/OSCA/multi-sample-comparisons.html#differential-abundance) they treat differential abundance similar to differential gene expression. They model each population using a negative binomial distribution, and then correct for library size differences, stabilize variance, and later consider compositional bias in a post-hoc manner. An alternative approach [scCODA](https://github.com/theislab/scCODA) models the question more directly using an actual multinomial distribution. They further account for uncertainty in clustering by taking a bayesian approach, and also account for over dispersion using a Dirichlet-multinomial. There are also methods such as [DA-seq](https://www.biorxiv.org/content/10.1101/711929v3) th ...



Source link