Hello,

I am trying to test different parameters in the function gbm of R in order to make predictions with my data . I have a huge table of 79866 rows and 1586 columns where columns are counts for motifs in the DNA and rows indicate diferent regions/positions in the DNA and the organism to which the counts belong. There are only 3 organism but the counts are sepatated by the positions(peakid).

data looks like this:
chrII:11889760_11890077 worm 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...

As I have problems with the memory that I don't know how to solve yet (because of the size of the table) so I am using a subset of the data:

motifs.table.sub<-motifs.table[1:1000, 1:1000]

set.seed(123)
motifs_split.sub <- initial_split(motifs.table.sub, prop = .7)
motifs_train.sub <- training(motifs_split.sub)

I create a table with different parameters to test

hyper_grid <- expand.grid( 
  shrinkage = c(.01, .1, .3), 
  interaction.depth = c(1, 3, 5),
  n.minobsinnode = c(5, 10, 15), 
  bag.fraction = c(.65, .8, 1),
  optimal_trees = 0,
  min_RMSE = 0)

Then I randomize the training data:

random_index.sub <- sample(1:nrow(motifs_train.sub), nrow(motifs_train.sub))
random_motifs_train.sub <- motifs_train.sub[random_index.sub, ]

test the different parameters with 1000 trees

for(i in 1:nrow(hyper_grid)) {#
  set.seed(123)
  gbm.tune <- gbm(
    formula = organism ~ .,
    distribution = "gaussian", #default
    data = random_motifs_train.sub,
    n.trees = 1000,
    interaction.depth = hyper_grid$interaction.depth[i],
    shrinkage = hyper_grid$shrinkage[i],
    n.minobsinnode = hyper_grid$n.minobsinnode[i],
    bag.fraction = hyper_grid$bag.fraction[i],
    train.fraction = 0.70,
    n.cores = NULL,
    verbose = V)
  print(head(gbm.tune$valid.error))}

The problem is that the model never improves:

Iter TrainDeviance ValidDeviance StepSize Improve

 1           nan             nan     0.0100       nan

 2           nan             nan     0.0100       nan

 3           nan             nan     0.0100       nan

 4           nan             nan     0.0100       nan

 5           nan             nan     0.0100       nan

 6           nan             nan     0.0100       nan

 7           nan             nan     0.0100       nan

 8           nan             nan     0.0100       nan

 9           nan             nan     0.0100       nan

10           nan             nan     0.0100       nan

And values like valid.error are not calculated, they remain as 'NA'. I tried changing the size of the data subset and the same happens with all the parameters I am testing. My data table is huge and it has a lot of zeros, I thouth of removing motifs with low counts but don't think that would help since only 127 motifs out of 1586 have less than 10 counts.
Any ideas of what I am doing wrong?

thanks!

PS: I am following this tutorial: uc-r.github.io/gbm_regression
Edit: apparently TrainDeviance is nan if train.fraction is not <1. But not my case. stackoverflow.com/questions/23530165/gradient-boosting-using-gbm-in-r-with-distribution-bernoulli



Source link