Re: [R] predict.glm -> which class does it predict?

Marc Schwartz Fri, 10 Jul 2009 09:32:20 -0700

On Jul 10, 2009, at 9:46 AM, Peter Schüffler wrote:

Hi,
I have a question about logistic regression in R.
Suppose I have a small list of proteins P1, P2, P3 that predict atwo-class target T, say cancer/noncancer. Lets further say I knowthat I can build a simple logistic regression model in R
model <- glm(T ~ ., data=d.f(Y), family=binomial) (Y is thedataset of the Proteins).
This works fine. T is a factored vector with levels cancer,noncancer. Proteins are numeric.
Now, I want to use predict.glm to predict a new data.
predict(model, newdata=testsamples, type="response") (testsamplesis a small set of new samples).
The result is a vector of the probabilites for each sample intestsamples. But probabilty WHAT for? To belong to the first levelin T? To belong to second level in T?
Is this fallowing expression
factor(predict(model, newdata=testsamples, type="response") >= 0.5)
TRUE, when the new sample is classified to Cancer or when it'sclassified to Noncancer? And why not the other way around?
Thank you,

Peter


As per the Details section of ?glm:

A typical predictor has the form response ~ terms where response isthe (numeric) response vector and terms is a series of terms whichspecifies a linear predictor forresponse. ***For binomial andquasibinomial families the response can also be specified as a factor(when the first level denotes failure and all others success)*** or asa two-column matrix with the columns giving the numbers of successesand failures. A terms specification of the form first + secondindicates all the terms in first together with all the terms in secondwith any duplicates removed.

So, given your description above, you are predicting"noncancer"...that is, you are predicting the probability of thesecond level of the factor ("success"), given the covariates.


If you want to predict "cancer", alter the factor levels thusly:

  T <- factor(T, levels = c("noncancer", "cancer"))

By default, R will alpha sort the factor levels, so "cancer" would befirst.

Think of it in terms of using a 0,1 integer code for absence,presence,where you are predicting the probability of a '1', or the presence ofthe event or characteristic of interest.


BTW, using 'T' as the name of the response vector is not a good habit:

> T
[1] TRUE

'T' is shorthand for the built in R constant TRUE. R is generallysmart enough to know the difference, but it is better to avoid gettinginto trouble by not using it.


HTH,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] predict.glm -> which class does it predict?

Reply via email to