Edit March 5th, 2020: there is no definitive answer to the OP's question. My original answer (below) is to merely foment ideas on how to address the question.



Hi Jack,

In order to achieve this, you will have to model your expression data using the PAM50 classifier genes, and then allow the modelling to decide which genes can be best used to predict the breast cancer sub-types.

Keep in mind that PAM50 is already a relatively small classifier that can segregate the different sub-types.

You could aim to do multinomial logistic regression because your aim is to use gene expression to predict multiple categorical variables (sub-types). I would also recommend trying out lasso-penalised regression, which is what I go over in this answer (below).

You'll require:

  • A gene expression dataset with PAM50 genes filtered in (if you don't know the genes, look here: Where To Download Pam50 Gene Set? )
  • Breast cancer subtypes for samples


require(glmnet)

# Get your data like this:
modellingLasso
           SubType   MYC  EGFR KIF2C CDC20
Sample 1  LuminalA 11.39 10.62  9.75 10.34
Sample 2  LuminalB 10.16  8.63  8.68  9.08
Sample 3  LuminalA  9.29 10.24  9.89 10.11
Sample 4      Her2 11.53  9.22  9.35  9.13
Sample 5      Her2  8.35 10.62 10.25 10.01
Sample 6      Her2 11.71 10.43  8.87  9.44
...

The SubType variable can be a factor:

modellingLasso$SubType
...
Levels: Her2 LuminalA LuminalB ...

Now we can build the models:


lassoModel <- glmnet(
  x=data.matrix(modellingLasso[,-1]),
  y=modellingLasso$SubType,
  standardize=TRUE,
  alpha=1.0,
  family="multinomial")
plot(lassoModel, xvar="lambda")

a


cv.lassoModel <- cv.glmnet(
  x=data.matrix(modellingLasso[,-1]),
  y=modellingLasso$SubType,
  standardize=TRUE,
  alpha=1.0,
  nfolds=10,
  family="multinomial",
  parallel=TRUE)

# plot variable deviances vs. shrinkage parameter, λ (lambda)
plot(cv.lassoModel)

b


Now we can identify best predictors via 2 metrics: min λ (lamba); 1 standard error of λ

idealLambda <- cv.lassoModel$lambda.min
idealLambda1se <- cv.lassoModel$lambda.1se
print(idealLambda); print(idealLambda1se)

# derive coefficients for each gene
co <- coef(cv.lassoModel, s=idealLambda, exact=TRUE)
co

co.se <- coef(cv.lassoModel, s=idealLambda1se, exact=TRUE)
co.se

# identify predictors for each sub-type
rownames(co.se$LuminalA)[which(co.se$LuminalA != 0)]
rownames(co.se$LuminalB)[which(co.se$LuminalB != 0)]
rownames(co.se$Her2)[which(co.se$Her2 != 0)]

For downstream applications, we could build a new model with our best predictors:

finalLasso <- glm(modellingLasso$SubType ~ MYC + ERBB2 + CCND1,
  data=modellingLasso,
  family=binomial(link="logit"))
summary(finalLasso)

We can also perform ROC analysis:

# ROC analysis
require(pROC)

roc <- roc(modellingLasso$SubType, fitted(finalLasso), smooth=FALSE)
plot.roc(
  roc,
  grid=TRUE,
  grid.lwd=2,
  col="royalblue",
  main="Lasso model")
text(0.3, 0.45,
  col="royalblue",
  paste("AUC (L95, AUC, U95):",
    paste(round(ci.auc(roc)[1:3], 3), sep=" ", collapse=","), sep=" "),
  cex=0.7)

roc



If you run into difficulty with the lasso-penalised regression, you can modify the stringency by relaxing α (alpha):

# elastic-net regression (here α=0.5)
cv.ElasticModel <- cv.glmnet(..., alpha=0.5, ...)

# ridge regression (α=0)
cv.RidgeModel <- cv.glmnet(..., alpha=0.00, ...)

Also see Professor Hastie's own in-depth tutorial: Glmnet Vignette



Source link