based on a clinical project of high-throughput genomics data, we have gathered a high number of RNA-Seq samples from patients with different solid tumors, that have undergone conventional therapy prior sequencing. All the data have been uniformly processed through R. The major issue that we would like to perform differential expression analysis or machine learning techniques, to select the most DE or more informative genes based on some reference sample group, but unfortunately we do not have any reference normal or control samples for the whole cohort.
I thought a naive idea of using external normal data sources, such as GTEx-however, my main concern is that still batch effect correction might not be applicable, such as ComBat, because both batch studies are totally confounded ? (i.e. both sample types are not represented in both studies..)
Any ideas or suggestions how this issue might be addressed ?