# correct way of analyzing cell proportions in singlecell data

correct way of analyzing cell proportions in singlecell data

1

Hello

In Seurat there is a function to take the proportions of each cell identity so you can easily plot it with ggplots or something similar. However, most scRNA datasets I have seem (I mostly reanalyze data) have different sample sizes for each condition. So I'm sure just taking the proportions of cells might not be adequate. I believe you would need to normalize this. The first thing that comes to mind is dividing the number of cell identities by the number of conditions, but it still doesn't make much sense I guess, as sometimes the same conditions may have a high variation of cell identities too. Here the authors plot it by log2 of relative proportions, which I believe it is Z-score, but still it is a bit weird to me, as they have different numbers of samples in each status.

I couldn't find any Seurat vignette addressing this. Any solutions? Does my concern make sense?

• 923 views

To compare cell proportions between conditions, I've found using a monte-carlo/permutation test to be the most sensible and robust way. The null hypothesis you want to test against is that the difference in cell proportions for each cluster between conditions is just a consequence of randomly sampling some number of cells for sequencing for each condition. To generate this null distribution, you "pool" the cells between both samples together, and then you randomly segregate the cells back into the two conditions maintaining original sample sizes. You then recalculate the proportional difference between the two conditions for each cluster, and compare that to the observed proportional difference for each cluster. I tend to take the log2 difference in proportions since it's a more sensible scale. Repeat this process about 10,000 times, and the p-value would be the number of simulations where the simulated proportional difference was as or more extreme than observed (plus one) over the total number of simulations (plus one).

Since I found myself having to do this so many times, I made a little R library for myself that takes a seurat object, and will do a permutation test for p-values (and adjusted p-values), as well as generate a plot with the observed proportional difference and a bootstrapped confidence interval for each cluster.

github.com/rpolicastro/scProportionTest 