Dear Community,

In our current research we are trying to fit Generalized Additive Models to a 
large dataset. We are using the package mgcv in R.

Our dataset contains about 22 million records with less than 20 risk factors 
for each observation, so in our case n>>p. The dataset covers the period 2006 
until 2011, and we analyse both the complete dataset and datasets in which we 
leave out a single year. The latter part is done to analyse robustness of the 
results. We understand k-fold cross validation may seem more appropriate, but 
out approach is closer to what is done in practice (how will one additional 
year of information affect your estimates?).

We use the function bam as advocated in Wood et al. (2017), and we apply the 
following options: bam(�, discrete=TRUE, chunk.size=10000, gc.level=1). We run 
these analyses on a computer cluster (see 
https://userinfo.surfsara.nl/systems/lisa/description for details), and the job 
is allocated to a node within the computer cluster. A node has at least 16 
cores and 64Gb memory.

We had expected 64Gb of memory to be sufficient for these analyses, especially 
since the bam function is built specifically for large datasets. However, when 
applying this function to the different datasets described above with different 
regression specifications (different risk factors included in the linear 
predictor), we sometimes obtain errors of the following form.

Error in XWyd(G$Xd, w, z, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop, ar.stop,  :

  'Calloc' could not allocate memory (22624897 of 8 bytes)

Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWyd

Execution halted

Warning message:

system call failed: Cannot allocate memory

Error in Xbd(G$Xd, coef, G$kd, G$ks, G$ts, G$dt, G$v, G$qc, G$drop) :

  'Calloc' could not allocate memory (18590685 of 8 bytes)

Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> Xbd

Execution halted

Warning message:

system call failed: Cannot allocate memory

Error: cannot allocate vector of size 1.7 Gb

Timing stopped at: 2 0.556 4.831

Error in system.time(oo <- .C(C_XWXd0, XWX = as.double(rep(0, (pt + nt)^2)),  :

  'Calloc' could not allocate memory (55315650 of 24 bytes)

Calls: fnEstimateModel_bam -> bam -> bgam.fitd -> XWXd -> system.time -> .C

Timing stopped at: 1.056 1.396 2.459

Execution halted

Warning message:

system call failed: Cannot allocate memory

The errors seem to arise at different stages in the optimization process. We 
have analysed whether these errors disappear if different settings are used 
(different chunk.size, different gc.level), but this does not resolve our 
problem. Also, the errors occur on different datasets when using different 
settings, and even when using the same settings it is possible that an error 
that occurred on dataset X in one run it does not necessarily occur on dataset 
X in a different run. When using the discrete=TRUE option, optimization can be 
parallelized, but we have chosen to not employ this feature to ensure memory 
does not have to be shared between parallel processes.

Naturally I cannot share our dataset with you which makes the problem difficult 
to analyse. However, based on your collective knowledge, could you pinpoint us 
to where the problem may occur? Is it something within the C-code used within 
the package (as the last error seems to indicate), or is it related to the 
computer cluster?

Any help or insights is much appreciated.

Kind regards,

Frank

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to