gravatar for banerjeeshayantan

2 hours ago by

I was recently going through the concepts required to remove batch effects from gene expression data. In particular, I was going through this tutorial by Jeff Leek and trying out the codes mentioned there. From what I have understood from the analysis, SVA is similar to PCA and tries to find two components that try to explain any unknown variations in the data (or batch effects) in an unsupervised fashion. However, I wanted to clarify one plot from the tutorial. First, they derive the sva components using the following code.

mod = model.matrix(~cancer,data=pheno)
mod0 = model.matrix(~1, data=pheno)
sva1 = sva(edata,mod,mod0,n.sv=2)

Then they see whether any of the sva components correlate with the batch effects

summary(lm(sva1$sv ~ pheno$batch))

They get the following results

## Response Y1 :
## 
## Call:
## lm(formula = Y1 ~ pheno$batch)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26953 -0.11076  0.00787  0.10399  0.19069 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.018470   0.038694  -0.477    0.635
## pheno$batch  0.006051   0.011253   0.538    0.593
## 
## Residual standard error: 0.1345 on 55 degrees of freedom
## Multiple R-squared:  0.00523,    Adjusted R-squared:  -0.01286 
## F-statistic: 0.2891 on 1 and 55 DF,  p-value: 0.5929
## 
## 
## Response Y2 :
## 
## Call:
## lm(formula = Y2 ~ pheno$batch)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.23973 -0.07467 -0.02157  0.08116  0.25629 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.121112   0.034157   3.546 0.000808 ***
## pheno$batch -0.039675   0.009933  -3.994 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1187 on 55 degrees of freedom
## Multiple R-squared:  0.2248, Adjusted R-squared:  0.2107 
## F-statistic: 15.95 on 1 and 55 DF,  p-value: 0.0001945

It is quite clear that the second component is interesting because of the fit having a significant p value. Now they draw a boxplot with the second component as the y axis and the batch effects as the x axis.

boxplot(sva1$sv[,2] ~ pheno$batch)

This is where I am stuck. I am unable to interpret this plot. enter image description here



Source link