I have a lot of public RNA-seqs from different experiments and genotypes sharing similar conditions/tissues. I want to build gene co-expressions networks out of them.
I did all the process to get the raw counts of every sample. I removed the samples with low counts (I rejected samples with less than 3 million reads in total), and the lowly expressed genes (only kept genes with more than 2 CPMs in at least 80% of samples). I ended up with a pretty OK matrix of >1200 samples x 15k genes, showing a nice bell curve when plus-one-log-transformed.
Now, the next steps are to normalize the counts and then try to deal with batch effects. I read that instead of log(), a better way of normalization is using a variance-stabilizing-transformation, and here start my doubts:
The VST should be applied over the raw counts, right? or the CPMs?!.
The VST should be applied to all the (not-filtered) samples together? Or should it be applied per-batch, or per-tissue at least?
Either way, if at some point later I reject samples for being outliers, I will have to repeat the VST without the rejected samples, right?
Thanks in advance!