Hello,

I am using R to train a logistic regression model and save the resulting
model to disk. I am then subsequently reloading these saved objects, and
using predict.glm on them in order to make predictions about single-row data
frames that are generated in real-time from requests arriving at an HTTP
server. The following code demonstrates the sort of R calls that I have in
mind:

> cases <- 2000000
> data <-
data.frame(x1=runif(cases),x2=runif(cases),y=sample(0:1,cases,replace=TRUE))
> lr1 <- glm(y~x1*x2,family=binomial,data=data)
> new_data <- data.frame(x1=0,x2=0)
> out <- predict(lr1,type="response",newdata=new_data)

The first thing I am noticing is that the models that I am storing are very
large because I am using large data-sets, and the models seem to store
residuals, fitted values and so on, by default.

> object.size(lr1)
1056071320 bytes

Access to all this information is not necessary for my application -- all I
really need is access to model$coefficients in order to make my predictions
-- so I am wondering if there is some way to prevent this information from
getting stored in the glm objects when they are created (or of removing it
after the models have been trained)? I have discovered the
model=FALSE,x=FALSE,y=FALSE switches to glm() and these seem to help
somewhat, but perhaps there is some other way of only recording the
coefficients of the model and other minimal details?

> lr2 <- glm(y~x1*x2,family=binomial,data=data,model=FALSE,x=FALSE,y=FALSE)
> object.size(lr2)
888065968 bytes

Secondly, on data-sets of the scale I am using, predict.glm seems to be
taking a very long time to make its predictions.

> print(system.time(predict(lr1,type="response",newdata=new_data)))
   user  system elapsed
  0.136   0.040   0.175
> print(system.time(predict(lr2,type="response",newdata=new_data)))
   user  system elapsed
  0.109   0.013   0.121

This may be an issue of swap-time, and so it could potentially be solved by
addressing my first question above. However, given that I am essentially
asking R to compute

1 / (1 + exp(-(b0 + b1*x1 + b2*x2 + b3*x1*x2)))

I can't see any reason why this request should be taking longer than a
hundredth or a thousandth of a second, say. Obviously R is providing a much
greater level of functionality than I am requiring in this particular
instance, so my overall question is what is the best way for me to reduce
the size of the data I have to store in my GLM models, and to increase the
speed at which I can use R to generate predictions of this sort (i.e. for
novel x1,x2 pairs)?

I could obviously write a custom function / class which only stores the
model coefficients and computes predictions based on these using the
equation above, but before I go down this route I wanted to get come advice
from the R community about whether there might be a better way to address
this problem and/or whether I have missed something obvious (to others). I
also want to avoid writing custom code if possible because that obviously
means sacrificing the great generality and power of R which could clearly be
useful in my application down the line.

Many thanks in advance for your assistance,

Dan.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to