I was recently going through the concepts required to remove batch effects from gene expression data. In particular, I was going through this tutorial by Jeff Leek and trying out the codes mentioned there. From what I have understood from the analysis, SVA is similar to PCA and tries to find two components that try to explain any unknown variations in the data (or batch effects) in an unsupervised fashion. However, I wanted to clarify one plot from the tutorial. First, they derive the sva components using the following code.
mod = model.matrix(~cancer,data=pheno)
mod0 = model.matrix(~1, data=pheno)
sva1 = sva(edata,mod,mod0,n.sv=2)
Then they see whether any of the sva components correlate with the batch effects
summary(lm(sva1$sv ~ pheno$batch))
They get the following results
## Response Y1 :
##
## Call:
## lm(formula = Y1 ~ pheno$batch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26953 -0.11076 0.00787 0.10399 0.19069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.018470 0.038694 -0.477 0.635
## pheno$batch 0.006051 0.011253 0.538 0.593
##
## Residual standard error: 0.1345 on 55 degrees of freedom
## Multiple R-squared: 0.00523, Adjusted R-squared: -0.01286
## F-statistic: 0.2891 on 1 and 55 DF, p-value: 0.5929
##
##
## Response Y2 :
##
## Call:
## lm(formula = Y2 ~ pheno$batch)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.23973 -0.07467 -0.02157 0.08116 0.25629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.121112 0.034157 3.546 0.000808 ***
## pheno$batch -0.039675 0.009933 -3.994 0.000194 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1187 on 55 degrees of freedom
## Multiple R-squared: 0.2248, Adjusted R-squared: 0.2107
## F-statistic: 15.95 on 1 and 55 DF, p-value: 0.0001945
It is quite clear that the second component is interesting because of the fit having a significant p value. Now they draw a boxplot with the second component as the y axis and the batch effects as the x axis.
boxplot(sva1$sv[,2] ~ pheno$batch)
This is where I am stuck. I am unable to interpret this plot.