Re: [R] Simple question about formulae in R!?

David Winsemius Fri, 10 Aug 2012 08:55:27 -0700


On Aug 10, 2012, at 8:09 AM, Joshua Wiley wrote:

On Fri, Aug 10, 2012 at 7:39 AM, Henrik Singmann
<henrik.singm...@psychologie.uni-freiburg.de> wrote:

Dear Michael and list,

R in general tries hard to prohibit this behavior (i.e., including an
interaction but not the main effect). When removing a main effect and
leaving the interaction, the number of parameters is not reduced byone (aswould be expected) but stays the same, at least when usingmodel.matrix:


This is because for factor or character variables, R automatically
dummy codes them (or whatever contrasts were set, dummies are the
default, but others could be set globally which would then propogate
to the specific model.matrix() call).

d <- data.frame(A = rep(c("a1", "a2"), each = 50), B = c("b1","b2"), value
= rnorm(10))
ncol(model.matrix(~ A*B, data = d))
#  [1] 4
ncol(model.matrix(~ A*B - A, data = d))
#  [1] 4

That addresses a different question than the one being posed. Thisanswers the question for the model being proposed:

> d <- data.frame(A = rep(c("a1", "a2", "a3"), each = 50), B =letters[1:5], value

+ = rnorm(10), covar=rnorm(10))
> ncol(model.matrix(~ A + B:covar, data = d))
[1] 8
>
> ncol(model.matrix(~ A +B*covar, data = d))
[1] 12

So, no, the A +B:covar model is not going to have the same number ofparameters as the A+B*covar model. I am lining up with Michael on thisone.


--
David.

I actually don't know understand how R parametrizes the model inthe secondcase, but I am pretty sure someone here might do so and be able toexplain.


One way to think about this is the regression versus ANOVA style of
parameterization.  Consider these three simple models

Here we have a traditional regression.  am is a two level factor, so
it gets one dummy code with the intercept.  The model matrix looks
something like:

i a
1 0
1 0
1 0
1 1
1 1
1 1

what does that mean for the expected values?  For a = 0,

yhat = b0 * 1 + b1 * 0 = b0 * 1, the intercept

for a = 1

yhat = b0 * 1 + b1 * 1

what is the difference between these two?  Merely b1.  Hence under
this parameterization, b0 is the expected value for the group a = 0,
and b1 is the difference in expected values between the two groups a =
0 and a = 1.

coef(lm(mpg ~ 1 + factor(am), data = mtcars))

(Intercept) factor(am)1
 17.147368    7.244939

We can find the expected values for each group.  The first is just b0,
the intercept term.  The secondn is the intercept plus b1, the single
other coefficient.  We can get this just by adding (or summing all the
coefficients) like so:

sum(coef(lm(mpg ~ 1 + factor(am), data = mtcars)))

[1] 24.39231

Now, if we explicitly force out the intercept, R uses a different
coding scheme.  The intercept normally captures some reference or
baseline group, which every other group is compared to, but if we
force it out, the default design matrix is something like:

1 0
1 0
1 0
0 1
0 1
0 1

this looks very similar to what we had before, only now the two
columns are opposites.  Let's think again what our expected values
are.  For a = 0

yhat = b0 * 1 + b1 * 0 = b0 * 1

for a = 1

yhat = b0 * 0 + b1 * 1 = b1 * 1

now b0 encodes the expected value of the group a = 0 and b1 encodes
the expected value of the group a = 1.  By default these coefficients
are tested against 0.  There is no more explicit test if the two
groups are different from each other because instead of creating
differences, we have created the overall group expectation.  In lm():

coef(lm(mpg ~ 0 + factor(am), data = mtcars))

factor(am)0 factor(am)1
  17.14737    24.39231

both those numbers should be familiar from before.  In this case, both
models included two parameters, the only difference is whether they
encode group expectations or an expectation and expected difference.
These are equivalenet models, and their likelihood etc. will be
identical.   Note however that if you suppress the intercept, R will
use a different formula for the R squared so those will not look
identical (although if you got predicted values from both models,
yhat, you would find those identical).

This concept generalizes to the interaction of categorical variables
without their main effects.  Normally, the interaction terms are
expected differences from the reference group, but if the "main
effect" is dropped, then instead, the interactions become the expected
values of the various cells (if you think of a 2 x 2 table of group
membership).

For example here it is as usual:

coef(lm(mpg ~ 1 + factor(vs) * factor(am), data = mtcars))

           (Intercept)             factor(vs)1             factor(am)1
             15.050000                5.692857                4.700000
factor(vs)1:factor(am)1
              2.928571

15.050000 + 5.692857

[1] 20.74286

15.050000 + 4.7

[1] 19.75

15.050000 + 5.692857 + 4.7 + 2.928571

[1] 28.37143

the intercept is the expected value for the group vs = 0 & am = 0,
then you get the difference between expectations for am = 0 & vs = 0
and am = 0 & vs = 1, then likewise for am, and finally, the additional
bit of being am = 1 and vs = 1.  I show the by hand calculations to
get the expected group values.  Right now, we get significance tests
of whether these differences are statistically significantly different
from zero.

If we drop the main effects and intercept, we get:

coef(lm(mpg ~ 0 + factor(vs) : factor(am), data = mtcars))

factor(vs)0:factor(am)0 factor(vs)1:factor(am)0factor(vs)0:factor(am)1

              15.05000                20.74286                19.75000
factor(vs)1:factor(am)1
              28.37143

which you can see directly gets all the group expectations.

Cheers,

Josh

I have asked a question on how to get around this "limitation" on
stackoverflow with helpful answers by Ben Bolker and Joshua Wiley:
http://stackoverflow.com/q/11335923/289572

(this functionality is now used in function mixed() in my newpackage afex

for obtaining "type 3" p-values for mixed models)


With few exceptions, I'm in the habit of not giving people type 3
p-values.  People like, but generally do not actually understand them.
I would also like to point out that I am in psychology, so yes I know
psychology likes them.  Further, I am a doctoral student so it is not
like I am so established that people bow to my every whim, it is work
to explain why I am not sometimes and I have had discussions in
various lab meetings and with advisors about this, but my point is
that it is possible.  I would urge you to do the same and take heart
that there are others working for change---you are not the only one
even if it feels like it at your university.

Cheers,
Henrik

Am 10.08.2012 15:48, schrieb R. Michael Weylandt:
On Fri, Aug 10, 2012 at 7:36 AM, ONKELINX, Thierry
<thierry.onkel...@inbo.be> wrote:
Dear Johan,
Why should it be complicated? You have a very simple model, thusa very
simple formula. Isn't that great?

Your formula matches the model. Though Trust~Culture + Structure *
Speed_of_Integration is another option. The model fit is thesame, the only
difference is the parameterization of the model.
Note quite, I don't think: A*B is shorthand for A + B + A:B whereA:B
is the 2nd order (interaction) term. The OP only had the 2nd order
term without the main effects.

OP: You almost certainly want A*B -- it's strange (though certainly
not impossible) to have good models where you only have interactions
but not main effects.

Cheers,
Michael
Best regards,

Thierry

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute forNature
and Forest
team Biometrie & Kwaliteitszorg / team Biometrics & QualityAssurance
Kliniekstraat 25
1070 Anderlecht
Belgium
+ 32 2 525 02 51
+ 32 54 43 61 85
thierry.onkel...@inbo.be
www.inbo.be
To call in the statistician after the experiment is done may beno morethan asking him to perform a post-mortem examination: he may beable to say
what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner
The combination of some data and an aching desire for an answerdoes notensure that a reasonable answer can be extracted from a givenbody of data.
~ John Tukey


-----Oorspronkelijk bericht-----
Van: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org]
Namens Johan Haasnoot
Verzonden: vrijdag 10 augustus 2012 9:15
Aan: r-help@r-project.org
Onderwerp: [R] Simple question about formulae in R!?

Good morning reader,



I have encountered a, probably, simple issue with respect to the
*formulae* of a *regression model* I want to use in my research.I'm
researching alliances as part of my study Business Economics (focus
Strategy) at the Vrije Universiteit in Amsterdam. In the researchmodel Iuse a moderating variable, I'm looking for confirmation or helpon the
formulation of the model.



The research model consist of 2 explanatory variables, a moderating
variable and 1 response variable. The first explanatory variableis Culture,measured on a nominal scale and the second is Structure of thealliance,also measured on a nominal scale. The moderating variable in therelation
towards Trust is Speed of Integration, measured as an interval.
The response variable is Trust, measured on a nominal scale (highly
likely a 5-point Likert scale). Given the research model and themeasurementscales, I intent to use a ordered probit model, often used inMarketing
models,  to execute the regression modelling. I can't seem to find
confirmation on how to model the formulae. I have been readingand studyingR! for a couple of weeks now, read a lot of books from the R!series in thepast, but I can't get a grasp on this seemingly simple formulae.I think Iunderstand how to model multinomial regression (using the R-package MNP),how to execute a Principal Components Analysis and an ExplanatoryFactoranalysis (obviously I'm using a questionnaire to collect mydata), but the
formulae itself seems to be to simple.
I expect to use the interaction symbol: "Trust~Culture +Structure :Speed_of_Integration" for the formulae, but is it really thissimple? Can
anybody confirm this or help me, advise me on this issue?


Kind regards,

--

Met vriendelijke groet,

Johan Haasnoot

De Haan & Martojo
Kerklaan 5
3645 ES Vinkeveen
Telefoon: 0297-266354
Mobiel: 06-81827665
Email: johan.haasn...@dh-m.nl
Website: www.dehaanenmartojo.nl

        [[alternative HTML version deleted]]
* * * * * * * * * * * * * D I S C L A I M E R * * * * * * * * * ** * *Dit bericht en eventuele bijlagen geven enkel de visie van deschrijverweer en binden het INBO onder geen enkel beding, zolang ditbericht niet
bevestigd is door een geldig ondertekend document.
The views expressed in this message and any annex are purelythose of thewriter and may not be regarded as stating an official position ofINBO, as
long as the message is not confirmed by a duly signed document.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Dipl. Psych. Henrik Singmann
PhD Student
Albert-Ludwigs-Universität Freiburg, Germany
http://www.psychologie.uni-freiburg.de/Members/singmann


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Simple question about formulae in R!?

Reply via email to