# How to Estimate total variance in, and between RNA-seq datasets?

I have an scRNA-seq dataset, and I want to look at the proportional variance between "samples" or even different datasets, batches, and so on.

Some people do batch-effect corretion, and then they show a bar-plot of "percent explained variance by batch".

I want to do something similar, that is, find the percent explained variance, between different conditions and comparisons.

My protocol so far, is similar to finding R in linear regression:

• Step 1: SSbetween = Find sum of squares for all samples
• Step 2: SSwithin= Find sum of squares within samples
• Step 3: % variance explained = SSbetween - SSwithin / SSbetween or
something similar.

The problem is that for each sample, there are 20.000 genes, each with their own variance. So how do I estimate the total variance of all genes and a group of samples.

I know how to do it for one gene, this is simple the sum( mean - xi )² where xi is the expression of the gene in sample i, but since there are many genes, each has their own variance. How do I calculate the total sample variance for all genes?

The simplest would be to sum them, but this would skew the variance for a few outlier samples with high expression / variance. What is the standard way to estimate group variance in batch correction or similar situations? 