On 30-Jun-09 14:52:20, Marc Schwartz wrote:
On Jun 30, 2009, at 4:54 AM, Renzi Fabrizio wrote:
I would like to know how R computes the probability of an event
in a logistic model (P(y=1)) from the score s, linear combination
of x and beta.
I noticed that there are differences (small, less than e-16)
between
the fitting values automatically computed in the glm procedure
by R,
and the values "manually" computed by me applying the reverse
formula p=e^s/(1+e^s); moreover I noticed that the minimum value
of the fitting values in my estimation is 2.220446e-16, and there
are many observation with this probability (instead the minimum
value obtained by "manually" estimation is 2.872636e-152).
It would be helpful to see at least a subset of the output from
your
model and your manual computations so that we can at least visually
compare the results to see where the differences may be.
The model object returned from using glm() will contain both the
linear predictors on the link scale (model$linear.predictors) and
the fitted values (model$fitted.values). The latter can be accessed
using the fitted() extractor function.
To use an example, let's create a simple LR model using the infert
data set as referenced in ?glm.
model1 <- glm(case ~ spontaneous + induced, data = infert,
family = binomial())
model1
Call: glm(formula = case ~ spontaneous + induced,
family = binomial(), data = infert)
Coefficients:
(Intercept) spontaneous induced
-1.7079 1.1972 0.4181
Degrees of Freedom: 247 Total (i.e. Null); 245 Residual
Null Deviance: 316.2
Residual Deviance: 279.6 AIC: 285.6
# Get the coefficients
coef(model1)
(Intercept) spontaneous induced
-1.7078601 1.1972050 0.4181294
# get the linear predictor values
# log odds scale for binomial glm
head(model1$linear.predictors)
1 2 3 4 5
6
1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
0.32560375
You can also get the above by using the coefficients and the model
matrix for comparison:
# the full set of 248
# coef(model1) %*% t(model.matrix(model1))
head(as.vector(coef(model1) %*% t(model.matrix(model1))))
[1] 1.10467939 -1.28973068 -0.87160128 -0.87160128 -0.09252564
0.32560375
# get fitted response values (predicted probs)
head(fitted(model1))
1 2 3 4 5 6
0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
We can also get the fitted values from the linear predictor values
by using:
LP <- model1$linear.predictors
head(exp(LP) / (1 + exp(LP)))
1 2 3 4 5 6
0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
You can also get the above by using the predict.glm() function with
type == "response". The default type of "link" will get you the
linear
predictor values as above. predict.glm() would typically be used to
generate predictions using the model on new data.
head(predict(model1, type = "response"))
1 2 3 4 5 6
0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
In glm.fit(), which is the workhorse function in glm(), the fitted
values returned in the model object are actually computed by using
the inverse link function for the family passed to glm():
binomial()$linkinv
function (eta)
.Call("logit_linkinv", eta, PACKAGE = "stats")
<environment: 0x171c8b10>
Thus:
head(binomial()$linkinv(model1$linear.predictors))
1 2 3 4 5 6
0.7511359 0.2158984 0.2949212 0.2949212 0.4768851 0.5806893
So those are the same values as we saw above using the other
methods.
So, all is consistent across the various methods.
Perhaps the above provides some insights for you into how R does
some
of these things and also to point out, as is frequently the case
with
R, there is more than one way to get the same result.
HTH,
Marc Schwartz
An interesting and informative reply, Marc; but I think it misses
the point of Fabrizio's query. I think Fabrizio's point is the
following:
set.seed(54321)
X <- sort(rnorm(100))
a0 <- 1.0 ; b0 <- 0.5
Y <- 1*(runif(100) < a0 + b0*X)
GLM <- glm(Y~X,family=binomial)
C <- GLM$coef
C
# (Intercept) X
# 2.726966 2.798429
a1 <- C[1] ; b1 <- C[2]
Fit0 <- GLM$fit
S <- a1 + b1*X
Fit1 <- exp(S)/(1+exp(S))
max(abs(Fit1 - Fit0))
# [1] 1.110223e-16
This discrepancy is of course, in magnitude, a typical "rounding
error". But the fact that it occurred indicates that when glm()
computed the fitted values it did not do so by using the fitted
coefficients GLM$coef, then creating the fitted score (linear
predictor) S (as above), then applyhing to this the "inverse link"
exp(S)/(1+exp(S)), since doing that would replicate the above
calculation and should yield identical results.
In fact, a bit more probing to GLM shows that there is already
a discrepancy at the "score" level:
S0 <- GLM$linear.predictors
max(abs(S0-S))
# [1] 8.881784e-16
so S0 has not been calculated by applying GLM$coef to X. Also, if
we apply the inverse link to S0, then:
max(abs(Fit0 - exp(S0)/(1+exp(S0))))
# [1] 1.110223e-16
which is the same discrepancy as between Fit1 and Fit0.
max(abs(Fit1 - exp(S0)/(1+exp(S0))))
# [1] 1.110223e-16
the same again!!
Hence, if I have understood him aright, Fabrizio's question.
Hoping this helps,
Ted.