Hello,
I am trying to test different parameters in the function gbm of R in order to make predictions with my data . I have a huge table of 79866 rows and 1586 columns where columns are counts for motifs in the DNA and rows indicate diferent regions/positions in the DNA and the organism to which the counts belong. There are only 3 organism but the counts are sepatated by the positions(peakid).
data looks like this:
chrII:11889760_11890077 worm 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...
As I have problems with the memory that I don't know how to solve yet (because of the size of the table) so I am using a subset of the data:
motifs.table.sub<-motifs.table[1:1000, 1:1000]
set.seed(123)
motifs_split.sub <- initial_split(motifs.table.sub, prop = .7)
motifs_train.sub <- training(motifs_split.sub)
I create a table with different parameters to test
hyper_grid <- expand.grid(
shrinkage = c(.01, .1, .3),
interaction.depth = c(1, 3, 5),
n.minobsinnode = c(5, 10, 15),
bag.fraction = c(.65, .8, 1),
optimal_trees = 0,
min_RMSE = 0)
Then I randomize the training data:
random_index.sub <- sample(1:nrow(motifs_train.sub), nrow(motifs_train.sub))
random_motifs_train.sub <- motifs_train.sub[random_index.sub, ]
test the different parameters with 1000 trees
for(i in 1:nrow(hyper_grid)) {#
set.seed(123)
gbm.tune <- gbm(
formula = organism ~ .,
distribution = "gaussian", #default
data = random_motifs_train.sub,
n.trees = 1000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = 0.70,
n.cores = NULL,
verbose = V)
print(head(gbm.tune$valid.error))}
The problem is that the model never improves:
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.0100 nan
2 nan nan 0.0100 nan
3 nan nan 0.0100 nan
4 nan nan 0.0100 nan
5 nan nan 0.0100 nan
6 nan nan 0.0100 nan
7 nan nan 0.0100 nan
8 nan nan 0.0100 nan
9 nan nan 0.0100 nan
10 nan nan 0.0100 nan
And values like valid.error are not calculated, they remain as 'NA'. I tried changing the size of the data subset and the same happens with all the parameters I am testing. My data table is huge and it has a lot of zeros, I thouth of removing motifs with low counts but don't think that would help since only 127 motifs out of 1586 have less than 10 counts.
Any ideas of what I am doing wrong?
thanks!
PS: I am following this tutorial: uc-r.github.io/gbm_regression
Edit: apparently TrainDeviance is nan if train.fraction is not <1. But not my case. stackoverflow.com/questions/23530165/gradient-boosting-using-gbm-in-r-with-distribution-bernoulli