On Jun 30, 2009, at 12:54 PM, Ted Harding wrote:


On 30-Jun-09 17:41:20, Marc Schwartz wrote:
On Jun 30, 2009, at 10:44 AM, Ted Harding wrote:


On 30-Jun-09 14:52:20, Marc Schwartz wrote:
On Jun 30, 2009, at 4:54 AM, Renzi Fabrizio wrote:

I would like to know how R computes the probability of an event
in a logistic model (P(y=1)) from the score s, linear combination
of x and beta.

I noticed that there are differences (small, less than e-16) between the fitting values automatically computed in the glm procedure by R,
and the values "manually" computed by me applying the reverse
formula p=e^s/(1+e^s); moreover I noticed  that the minimum value
of the fitting values in my estimation is 2.220446e-16, and there
are many observation with this probability (instead the minimum
value obtained by "manually" estimation is 2.872636e-152).

It would be helpful to see at least a subset of the output from your
model and your manual computations so that we can at least visually
compare the results to see where the differences may be.

The model object returned from using glm() will contain both the
linear predictors on the link scale (model$linear.predictors) and
the fitted values (model$fitted.values). The latter can be accessed
using the fitted() extractor function.

To use an example, let's create a simple LR model using the infert
data set as referenced in ?glm.

model1 <- glm(case ~ spontaneous + induced, data = infert,
            family = binomial())

model1
Call:  glm(formula = case ~ spontaneous + induced,
           family = binomial(), data = infert)

Coefficients:
(Intercept)  spontaneous      induced
   -1.7079       1.1972       0.4181

Degrees of Freedom: 247 Total (i.e. Null);  245 Residual
Null Deviance:            316.2
Residual Deviance: 279.6      AIC: 285.6

# Get the coefficients
coef(model1)
(Intercept) spontaneous     induced
-1.7078601   1.1972050   0.4181294

# get the linear predictor values
# log odds scale for binomial glm
head(model1$linear.predictors)
         1           2           3           4           5
6
1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
0.32560375

You can also get the above by using the coefficients and the model
matrix for comparison:
# the full set of 248
# coef(model1) %*% t(model.matrix(model1))
head(as.vector(coef(model1) %*% t(model.matrix(model1))))
[1]  1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
0.32560375

# get fitted response values (predicted probs)
head(fitted(model1))
       1         2         3         4         5         6
0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893

We can also get the fitted values from the linear predictor values
by using:

LP <- model1$linear.predictors
head(exp(LP) / (1 + exp(LP)))
      1         2         3         4         5         6
0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893

You can also get the above by using the predict.glm() function with
type == "response". The default type of "link" will get you the
linear
predictor values as above. predict.glm() would typically be used to
generate predictions using the model on new data.

head(predict(model1, type = "response"))
       1         2         3         4         5         6
0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893

In glm.fit(), which is the workhorse function in glm(), the fitted
values returned in the model object are actually computed by using
the inverse link function for the family passed to glm():

binomial()$linkinv
function (eta)
.Call("logit_linkinv", eta, PACKAGE = "stats")
<environment: 0x171c8b10>

Thus:
head(binomial()$linkinv(model1$linear.predictors))
       1         2         3         4         5         6
0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893

So those are the same values as we saw above using the other methods.
So, all is consistent across the various methods.

Perhaps the above provides some insights for you into how R does some of these things and also to point out, as is frequently the case with
R, there is more than one way to get the same result.

HTH,
Marc Schwartz

An interesting and informative reply, Marc; but I think it misses
the point of Fabrizio's query. I think Fabrizio's point is the
following:

set.seed(54321)
X <- sort(rnorm(100))
a0 <- 1.0 ; b0 <- 0.5
Y <- 1*(runif(100) < a0 + b0*X)
GLM <- glm(Y~X,family=binomial)
C <- GLM$coef
C
# (Intercept)           X
#    2.726966    2.798429
a1 <- C[1] ; b1 <- C[2]
Fit0 <- GLM$fit
S <- a1 + b1*X
Fit1 <- exp(S)/(1+exp(S))
max(abs(Fit1 - Fit0))
# [1] 1.110223e-16

This discrepancy is of course, in magnitude, a typical "rounding
error". But the fact that it occurred indicates that when glm()
computed the fitted values it did not do so by using the fitted
coefficients GLM$coef, then creating the fitted score (linear
predictor) S (as above), then applyhing to this the "inverse link"
exp(S)/(1+exp(S)), since doing that would replicate the above
calculation and should yield identical results.

In fact, a bit more probing to GLM shows that there is already
a discrepancy at the "score" level:

S0 <- GLM$linear.predictors
max(abs(S0-S))
# [1] 8.881784e-16

so S0 has not been calculated by applying GLM$coef to X. Also, if
we apply the inverse link to S0, then:

max(abs(Fit0 - exp(S0)/(1+exp(S0))))
# [1] 1.110223e-16

which is the same discrepancy as between Fit1 and Fit0.

max(abs(Fit1 - exp(S0)/(1+exp(S0))))
# [1] 1.110223e-16

the same again!!

Hence, if I have understood him aright, Fabrizio's question.

Hoping this helps,
Ted.


Ted,

What OS and version of R are you using?

I am on OSX 10.5.7 and using 32 bit:

  R version 2.9.1 Patched (2009-06-30 r48877)

The reason that I ask is that using your data and model, I get:

max(abs(Fit1 - Fit0))
[1] 0

max(abs(S0-S))
[1] 0

max(abs(Fit0 - exp(S0)/(1+exp(S0))))
[1] 0


Truth be told, I was initially leaning in the direction of a rounding
error, but first wanted to be sure that there was not some underlying
methodological error that Fabrizio was encountering. Hence the focus
of my reply was to present a level of consistency in getting the
results using multiple methods.

When I saw your post, I started thinking that my example, which did
not have a numeric (eg. non-integer) covariate, might have been too
simple and missed introducing an additional source of rounding error,
but then I could not replicate your findings either...

Thanks,
Marc

I'm using Debian Linux:
Linux deb 2.6.18-6-686 #1 SMP Tue May 5 00:40:20 UTC 2009 i686 GNU/ Linux
(32-bit as far as 5 know) and R version:
 version.string R version 2.9.0 (2009-04-17)

After posting, I began to suspect that the compiled code which does
the actual comnputations may be working to higher precision internally,
but rounding to R's:

 $double.eps
 [1] 2.220446e-16

 $double.neg.eps
 [1] 1.110223e-16

before attaching the results to the return-list for glm() (compare
the numerical discrepancies which I found above).

However, that it only surmise!
Ted.



Well....barring a correction from someone with more low level insight, I would tend to suspect that the difference in our findings may be attributed to some interaction of hardware, OS, compiler and BLAS, perhaps weighted more to the latter two. When I build from source, I do use the Apple vecLib BLAS (the default), as opposed to the R BLAS.

To distill this down to the key issue, all else being equal (within .Machine$double.eps), I suspect that this leads us back to rounding as being the source of Fabrizio's observations.

Marc

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to