On 07-Oct-08 22:23:22, Bert Gunter wrote: > But it **is** indexed in both of V&R's MASS and S Programming. > I have no idea whether the info there will be helpful to you, > of course. I would find (and have found) it so. > -- Bert Gunter
The discussion of factors in V&R is certainly quite comprehensive, but it is not for beginners! A more elementary and very readable published text is Peter Dalgaard's "Introductory Statistics with R". An even more introductory, but still adequate, account can be found in various places of Julian Faraway's "Practical Regression and Anova using R" which is on-line on CRAN under Documentation/Contributed. However, you will need to piece together the bigger picture from passages found in various places. There is no index, but a search for "factor" in the PDF file throws up: pages 11; 69-70; Chapter 15 (160-167) -- especially section 15.2; Chapter 16 (168-203) -- though this deals mainly with factorial experimental designs. A reference with more detail at the technical level from the R viewpoint (but still well spelt out) is John Maindonald's "Using R for Data Analysis and Graphics - Introduction, Examples and Commentary", especially section 2.4. This is also on-line in the same section of CRAN. That being said, on the grounds that an introductory outline may also be useful to others, here is a summary. Factors are variables which, essentially, introduce a "contingency table" structure into the data (and they can co-exist with variables which have quantitative interpretation). A factor is a variable with categorical values -- an item is an "A", or a "B", or a "C", ... -- used in a particular way. It may or may not make sense to consider A, B, C, ... as ordered: A < B < C < ... say. For example, a variable called Sex may have values "M" (for Male) or "F" (for Female). Whether one can consider that M < F is something I will not discuss (though others may have a view). Or Social Class may have categories A (highest) > B > C > D > E (lowest). Or, say, an ecological classification of terrain may use "Grassland", "Forest", "Swamp" with no implication of any ordering: they are all on the same footing. The category labels of factors are called "Levels". As seen in the data, these labels may be alphabetic, numeric, or both -- e.g. M or F for Sex, which people also often code as 1 or 2 (but with no implication that 1 < 2); Terrain may be G, F or S or 1, 2, 3; Social Class my be subdivided into A1, A2, B1, B2, ... (with implied ordering A1 > A2 > B1 > B2 > ... ). In regression analysis, the usefulness of factors is that they allow comparison between the outcomes for different levels of the factors. In simple cases the result may be as simple as the difference between the mean of cases with level A and the mean of cases with level B of sa single factor. This is where the plot starts to thicken. For example, if Terrain were coded 1, 2, 3 you would not want to treat these as quantitative values (even if they represented ordered levels). Instead, a factor with k levels is presented to the regression in terms of k "dummy variables". If the regression model has an intercept, then one level (the "base level") of the factor will be absorbed into the Intercept. So, for instance, data on weight(Kgm) might look like Sex Weight M 69.5 F 60.2 F 65.7 M 72.5 .... This would be transformed into Sex.M Sex.F Weight 1 0 69.5 0 1 60.2 0 1 65.7 1 0 72.5 where, now, the 0s and 1s will have their *quantitative* interpetation. So the regression model Weight ~ Sex now becomes the quantitative regression Weight = a + b.M*Sex.M + b.F*Sex.F + error using the values 0 and 1 of Sex.M and Sex.M quantitatively. However, since Sex.F + Sex.M = 1 throughout, one is redundant in the presence of the intercept (whose "dummy" equivalent has value 1 throughout). Hence the results of this regression will usually be presented as Intercept together with the coefficient of (say) Sex.F. However, if you left out the Intercept, giving the model formula Weight ~ Sex - 1, then the above data matrix with both dummy variables Sex.M and Sex.F would be used in full in the regression, whoch would fit the equation Weight = b.M$Sex.M + b.F*Sex.F + error without redundancy (and in this case the coeficients would be the mean of the weights of Males [b.M] and the mean of the weights of Females [b.F]). If there are two factors in the regression, say Sex (M/F) and Diet (M = meat-eater, V = vegetarian), then the possibilities are richer. One might then have, for the regression model Weight ~ Sex + Diet Sex.M Sex.F Diet.M Diet.V Weight 1 0 0 1 69.5 0 1 0 1 60.2 0 1 0 1 65.7 1 0 0 1 72.5 1 0 1 0 74.5 0 1 1 0 65.2 0 1 1 0 70.7 1 0 1 0 77.5 which would fit the equation Weight = b.S.F*Sex.F + b.D.V*Diet.V + error with the same absorption of a base-level of each factor into the Intercept (since now we have 2 redundancies: for each factor, the two dummy variables add up to 1). The coefficient of Sex.F will represent a difference between Males and Females, the coefficient of Diet.V will represent a difference between meat-eaters and vegetarians. Because of the redundacies, an equivalent representation of the data used in the calculations is Sex.F Diet.V Weight 0 1 69.5 1 1 60.2 1 1 65.7 0 1 72.5 0 0 74.5 1 0 65.2 1 0 70.7 0 0 77.5 But now we have the opportunity to ask: Is the difference between meat-eater and vegetarian Males the same as the difference between meat-eater and vegetarian Females? Now we need the Interaction -- the difference, between Males and Females, of the two differences between the two diets: one difference evaluated for Males, the other for Females. This leads to the regression model Weight ~ Sex * Diet, equivalent to Weight ~ Sex + Diet + Sex:Diet and we now need a further dummy variable for the different combinations of levels of the two factors: Sex.F Diet.V Sex.F:Diet.V Weight 0 1 0 69.5 1 1 1 60.2 1 1 1 65.7 0 1 0 72.5 0 0 0 74.5 1 0 0 65.2 1 0 0 70.7 0 0 0 77.5 where the variable Sex.F:Diet.V has the value 1 when Sex.F=1 and Diet.V=1, and the value 0 otherwise. This is all very basic and straightforward (though can appear more complicated in richer problems). But the point about using a variable of "factor" type in R is beginning to emerge. When there is a factor with k levels, you need (k-1) dummy variables as quantitative variables for the regression. Interactions introduce further dummy variables. For all this to happen, a variable which is going to be used as a factor needs a special representation inside R, so that R knows how to set about constructing all that stuff. So, in R, a factor is not a simple list of levels (like c("M","F","F","M","M","F","F","M")), but a more elaborate encoding, and a more complex structure. Once past this stage, there is then the question of what system of *contrasts* is going to be used. For 2-level factors (as above) there are not many issues which arise -- the effect of a factor corresponds to a simple difference between the results corresponding to its two levels. But, say, for the Terrain factor (G,F,S) there are several ways in which differences can be formulated. For example: G, F-G, S-G ("treatment contrasts") Or, for Social Class (ordered, A>B>C>D>E) D-E, C-D, B-C, A-B ("successive difference contrasts") E, D-E, C-(mean of D&E), B-(mean of C&D&E), A-(mean of B&C&D&E) ("Helmert contrasts") and so on. What system of contrasts you use will depend on what aspects of the differences between categories you are interested in. And then the contrast specification also has to be part of the specification of a factor (since it determines how to compute the dummy variables which will represent it in the regression). See John Maindonald's on-line book. Hoping this helps! Ted. > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On > Behalf Of [EMAIL PROTECTED] > Sent: Tuesday, October 07, 2008 2:29 PM > To: r-help@r-project.org > Subject: [R] Factor tutorial? > > This is probably a very basic question. I want to understand factors > but I > am not sure where to turn. Looking up factor in the Chambers book > doesn't > even show up in the index. Maybe I am just slow but ?factor doesn't > help > either. Would someone please point me to a very basic tutorial where I > can > see what the usefullness of factors is (so far they have just gotten in > the > way). > > Thank you. > > Kevin > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 08-Oct-08 Time: 01:30:31 ------------------------------ XFMail ------------------------------ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.