Hi, I am trying to do PCA analysis for my samples for initial quality control. I have 2 different sets of samples - one were sequenced 50bp and other was sequenced 75bp (both of them have disease and control cases). To do the PCA on those samples, I ran DEseq2 on them (which necessarily requires non-normalized counts), followed by vst and plotPCA. But in my PCA plot, I see two clusters - one for the 50bp samples and the other for 75 bp samples. This is not necessarily expected, since there is nothing different between the samples except for sequencing depth. Someone said I should normalize my data. But I think that will be taken care of by DESeq2, and it anyway shouldn't be fed normalized counts. Here is the plot.

Any suggestions?
PCA plot link

Here is the image link in case it doesn't show - imgur.com/CFXHky4

Here is my code. Here P and NP mean Pain and Non-Pain, which is the effect I am studying. Some files are 50bp, others are 75bp, and I have not provided that info to DESeq -

counts_all_fullGTF <- featureCounts(nthreads=3, isGTFAnnotationFile=TRUE, annot.ext='/Volumes/bam/DRG/annotations/Homo_sapiens.GRCh38.95.gtf', files=c('/Volumes/.../47T7L.fastqAligned.sortedByCoord.out.bam','/Volumes/.../47T7R.fastqAligned.sortedByCoord.out.bam'))$counts

sampleTable_all <- data.frame(condition=factor(c('P','NP','P','P','P','P','NP','P','P','NP','P','P','P','P','P','P','NP','P','P','NP','P','P','P','P','P','NP','NP','NP','NP','P','P','P','P','P')))

coldata <- sampleTable_all
deseqdata_fullGTF <- DESeqDataSetFromMatrix(countData=counts_all_fullGTF, colData=coldata, design=~condition)
dds_fullGTF <- DESeq(deseqdata_fullGTF)
vsd_fullGTF <- vst(dds_fullGTF)
library(ggrepel)
plotPCA(vsd_fullGTF, ntop=5000, "condition") + geom_label_repel(aes(label=substr(name, start = 1, stop = 6), colour = "white", fontface = "bold"))



Source link