Hi Johan,
 
I happen to agree with most of what you say - at least in principle...
Let me begin by accentuating the positive, (as Bing may have said):

* R help files could be much improved - agreed.  No question.  But let's
look at your example.  ?anova points out (in parenthesis, admittedly)
that anova is *generic*, so what it does for any particular object
depends on its class.  A call to 

> methods(anova)

(or just looking at the index of the HTML help window) will show a lot
of methods, depending on what you have on the search path, but will
almost certainly include methods for "lm", "glm", "mlm", ...  The
relevant one we all have in mind here is anova.lm.  Here's where the
nitty gritty of the help information really happens.  It says

"Specifying a single object gives a sequential analysis of variance
table for that fit. That is, the reductions in the residual sum of
squares as each term of the formula is added in turn are given in as the
rows of a table, plus the residual sum of squares."

To me that seems accurate, precise and succinct.  It's not obvious to
newbies how to go about finding this, I know, but learning how to do
that is an important part of learning R.  Perhaps what we really need is
more "meta"-help information - help on learning how to use help.  Well,
go to it...it's open-source software built on the contributions of its
users.

* anova (more precisely, anova.lm) should have options for "seqeuntial"
(the default and present case) or "conditional" analyses.  Well,
perhaps.  My view is that in R you present a set of fairly simple tools
for people to do a job, and that people should learn the job along with
the tools, rather than tinkering with the tools to fit in with some
alternative model of the job.  And there are other tools if you want
them.  For example

- dropterm(..., test = "F") from the MASS package will (surprisingly
enough!) give you almost the conditional analysis that people so crave.
It politely refuses to provide the non-marginal terms, however, that the
authors of the supporting book find potentially so misleading.

- Anova(...) in the 'car' package provices pretty well the whole
enschilada, but again supported by an excellent book by John Fox, so the
potential for damage is minimised.

I'm not suggesting the tools in R are either perfect or immutable.  Nor
are they complete.  But in changing them I think we need to be guided by
different criteria from that of making it easy for people migrating to R
from other systems.  That's important, of course, but in my view it's
better to handle that issue by special support documentation, such as
http://oit.utk.edu/scc/RforSAS&SPSSusers.pdf, for example. In R we
should be concerned with providing simple, elegant and effective tools
for people with some understanding of what they are doing, or who are
prepared to use R along with some other support, such as a book.  And
there are now plenty of books either available or soon to become so, at
all kinds of levels: Wiley, Chapman and Hall and Springer, to name just
three, are all now seriously into providing books that use R as the
supporting software, and there is even a lot of free material available
as well.

I could go on and on about this, but that's certainly more than enough
for now.  Here endeth the philosophy.

Bill Venables.



________________________________

        From: Johan Jackson [mailto:[EMAIL PROTECTED] 
        Sent: Saturday, 9 February 2008 7:26 AM
        To: Venables, Bill (CMIS, Cleveland)
        Cc: [EMAIL PROTECTED]; r-help@r-project.org
        Subject: Re: [R] a kinder view of Type III SS
        
        
        I feel out of my league responding to a discussion among such an
august group of statisticians. But I think I can maybe provide some
insight from someone who migrated from SPSS into R and learned R on my
own.
        
        I must say that I found it quite confusing to understand why my
ANOVA results in R were completely different from those given by SPSS.
In retrospect it is obvious, as it must seem to everyone who has R
experience, but for people not versed in R, or where to look for help,
these types of issues can be extremely frustrating. I figured out the
issue eventually, but more through dumb luck and persistence than
through any help from R.
        
        From the point of a newbie in R (and let's face it, R is taking
over the statistical landscape, which is great, but it also means that
from here on out increasing numbers of primary researchers are going to
be migrating into R), I think R could make things easier on new users.
This is the best example I can think of. When you type in ?anova,  the
only thing remotely relevant to this confusing issue is: "When given a
single argument it produces a table which tests whether the model terms
are significant." Not at all helpful given that we do, after all, for
better or for worse, live in a world in which "whether the model terms
are significant" can mean three different things, depending on whether
type I, II, or III sums of squares were used!
        
        Is it possible to forward a suggestion? Can the anova function
include an option for tests to be "sequential" or "conditional", with
default being "sequential"? I suspect the powers that be will not like
this. So be it. At the very least, in the help page for anova, could you
include just a brief description that the sums of squares produced are
"sequential, corresponding to what SAS and other statsitical programs
call "type I" sums of squares. If you are interested in conditional, or
"type III", sums of squares, packages can be installed that allow for
this." This might really help out earnest and eager yet confused R
newbies. Barring this, perhaps the final option is to remove the ability
for anova to take in a single argument at all - you must include two
models which are to be compared.
        
        Best,
        
        JJ
        
        
        
        
        
        
        
        
        On Feb 7, 2008 6:35 PM, <[EMAIL PROTECTED]> wrote:
        

                Frank Harrell has already added some comments, with
which I agree.
                
                As one of the people who did become rather heated in the
discussion, let
                me add a few points in a (fairly) calm and considered
way.
                
                1. The primary objection I had to all of this is that it
encourages
                people to think of analysis of variance in such a
simplistic way, i.e.
                in terms of 'sums of squares'.  This leads to silly
questions like
                "Well, if you don't like Type III sums of squares, what
type do you
                like?" as if the concept of multiple "types" of sum of
squares had any
                meaning.  There is only one type and it represents a
squared distance in
                the sample space.  The real question is how to interpret
the vector in
                sample space of which any particular sum of squares is a
squared length.
                For that you need to be very clear about both the null
hypothes
                implicitly being tested, and the outer hypothesis being
assumed.  This
                issue can become very subtle when interactions are
involved, as you
                point out.
                
                2. Most of the heat came from resentment that SAS should
insinuate its
                way into the statistical community in such a
Microsoft-like way, i.e.
                trying to make both its black-box software and defective
terminology
                some kind of industry standard.
                
                
                Bill Venables
                CSIRO Laboratories
                PO Box 120, Cleveland, 4163
                AUSTRALIA
                Office Phone (email preferred): +61 7 3826 7251
                Fax (if absolutely necessary):  +61 7 3826 7304
                Mobile:                         +61 4 8819 4402
                Home Phone:                     +61 7 3286 7700
                mailto:[EMAIL PROTECTED]
                http://www.cmis.csiro.au/bill.venables/
                

                -----Original Message-----
                From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]
                On Behalf Of Bernard Leemon
                Sent: Friday, 8 February 2008 7:42 AM
                To: r-help@r-project.org
                Subject: [R] a kinder view of Type III SS
                
                A young colleague (Matthew Keller) who is an ardent fan
of R is teaching
                me
                much about R and discussions surrounding its use.  He
recently showed me
                some of the sometimes heated discussions about Type I
and Type III
                errors
                that have taken place over the years on this listserve.
I'm presumptive
                enough to believe I might add a little clarity.  I write
this from the
                perspective of someone old enough to have been grateful
that the stat
                programmers (sometimes me coding in Fortran) thought to
provide me with
                model tests I had not asked for when I carried heavy
boxes of punched
                cards
                across campus to the card reader window only to be told
to come back a
                day
                or two later for my output.  I'm also modern enough to
know that
                anova(model1, model2), where model2 is a proper subset
of model1, is all
                that I need and allows me to ask any question of my data
that I want to
                ask
                rather than being constrained to those questions that
the SAS or SPSS
                programmer thought I might want to ask.  I could end
there, and we would
                probably all agree with what I have said to this point,
but I want to
                push
                the issue a bit and say: it seems that Type III Sums of
Squares are
                being
                unfairly maligned among the R cognoscenti. And the
practical
                ramification of
                this is that it creates a good deal of confusion among
those migrating
                from
                SAS/SPSS land into R - not that this should ever be a
reason to
                introduce a
                flawed technique into R, but my argument is that type
III sums of
                squares
                are not a flawed technique.
                
                In my reading of the prior discussions on this list, my
conclusion is
                that
                the Type I/Type III issue is a red herring that has
generated
                unnecessary
                heat.  Base R readily provides both types.  summary(lm(
y ~ x + w + z))
                provides estimates and tests consistent with Type III
sums of squares
                (it
                doesn't provide the SS directly but they are easily
derived from the
                output)
                and anova(lm(y ~ x + w + z)) provides tests consistent
with Type I sums
                of
                squares.  The names Type I and III are dreadful "gifts"
from SAS and
                others.
                 I'd prefer "conditional tests" for those provided by
summary() because
                what
                is estimated and tested are x|w,z    w|x,z   and  z|x,w
[read these as
                "x
                conditional on w and z being in the model"] and
"sequential" for those
                provided by anova(), being x, w|x, and z|x,w.  None of
these tests is
                more
                or less valid or useful than any of the others.  It
depends on which
                questions researchers want to ask of their data.
                
                Things get more interesting when z  represents the
interaction between x
                and
                w, such that z = x * w = xw.  Fundamentally everything
is the same in
                terms
                of the above tests.  However, one must be careful to
understand what the
                coefficient and test for x|w,xw and w|x,xw mean.  That
is, x|w,xw tests
                the
                relationship between x and y when and only when w = 0.
A very, very
                common
                mistake, due to an overgeneralization of traditional
anova models, is to
                refer to x|w,xw as the "main effect."  In my list of ten
statistical
                commandments I include: "Thou shalt never utter the
phrase main effect"
                 because it causes so much unnecessary confusion.  In
this case, x|w,xw
                is
                the SIMPLE effect of x when w = 0.  This means among
other things that
                if
                instead we use w' = w - k so as to change the 0 point on
the w' scale,
                we
                will get a different estimate and test for x|w',xw'.
Many correctly
                argue
                that the main effect is largely meaningless in the
presence of an
                interaction because it implies there is no common
average effect.
                However,
                that does not invalidate x|w,xw because it is NOT a
"main" (sense
                "principal" or "chief") effect but only a "simple"
effect for a
                particular
                level of w.  A useful strategy for testing a variety of
simple effects
                is to
                subtract different constants k from w so as to change
the 0 value to
                focus
                the test on particular simple effects.
                
                
                 If x and w are both contrast codes (-1 or 1) for the
two factors of a 2
                x 2
                design, then x|w,xw is the simple effect of x when w =
0.   While w
                never
                equals 0, in a balanced design w does equal 0 on
average.  In that one
                very
                special case, the simple effect of x when w = 0 equals
the average of
                all
                the simple effects and in that one special case one
might call it the
                "main
                effect."  However, in all other situations it is only
the simple effect
                when
                w = 0.  If we discard the term "main effect", then a lot
of unnecessary
                confusion goes away.  Again, if one is interested in the
simple effect
                of x
                for a particular level of w, then one might want to use,
instead of a
                contrast code, a dummy code where the value of 0 is
assigned to the
                level of
                w of interest and 1 to the other level.
                
                When factors have multiple levels, it is best to have
orthogonal
                contrast
                codes to provide 1-df tests of questions of interest.
Products of those
                codes are easily interpreted as the simple difference
for one contrast
                when
                the other contrast is fixed at some level.  Multiple
degree of freedom
                omnibus tests are troublesome but are only of interest
if we are fixated
                on
                concepts like 'main effect.'
                
                gary mcclelland (aka bernie leemon)
                colorado
                
                       [[alternative HTML version deleted]]
                
                ______________________________________________
                R-help@r-project.org mailing list
                https://stat.ethz.ch/mailman/listinfo/r-help
                PLEASE do read the posting guide
                http://www.R-project.org/posting-guide.html
                and provide commented, minimal, self-contained,
reproducible code.
                
                ______________________________________________
                R-help@r-project.org mailing list
                https://stat.ethz.ch/mailman/listinfo/r-help
                PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
                and provide commented, minimal, self-contained,
reproducible code.
                

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to