I think you are perhaps unintentionally obscuring two issues. One is
whether R might have the statistical functions to deal with such an
arrangement, and here "mixed models" would be the phrase you ought to
be watching for, while the other would be whether it would have pre-
written data management functions that would directly support the
particular data layout you might be getting from public-access gov't
files. The second is what I _thought_ you were soliciting in your
original posting. I was a bit surprised that no one mentioned the
survey package, since I have seen it used in such situations, but I
cannot track down the citation at the moment. You might want to look
at Gelman's blogs:
http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html
See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1
And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis
Techniques in Health Policy Data"
R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
--
David.
On May 5, 2010, at 10:23 PM, Marshall Feldman wrote:
Thanks for sharing this, Ista.
I've come to the conclusion that R doesn't have what I'm looking for,
either in the base or the packages.
Although your examples are insightful, the examples we've been
discussing are deliberately easier than what one would expect in most
serious applications. Imagine for instance that we're studying wage
structures of industries in different geographic labor markets. We
therefore might have four variables: wages, industries, occupations,
and
places. We might want to see if wage differentials are more or less
constant or if they are higher in some geographic areas than in
others.
Since industries, occupations, and places are typically coded
hierarchically as we've been discussing, we might want to figure out
how
to examine different wage levels within industries, etc. Doing this
manually would require lots of w
whereas conceptually the
On 5/4/2010 6:00 AM,
Message: 49 Date: Mon, 3 May 2010 13:22:59 -0400 From: Ista Zahn
<istaz...@gmail.com> To: Marshall Feldman <ma...@uri.edu> Cc:
r-help@r-project.org Subject: Re: [R] Hierarchical factors Message-
ID:
<x2xf55e7cf51005031022se4c46967s174efeef95331...@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1 Hi Marshall, I'm not
aware of any packages that implement these features as you described
them. But most of the tasks are already fairly easy in R -- see
below.
On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <ma...@uri.edu>
wrote:
Thanks for getting back so quickly Ista,
I was actually casting about for any examples of R software that
deals with this kind of structure. But your question is a good
one. Here are a few things I'd like to be able to do:
Store data in R at the finest level of detail but easily refer to
higher levels of aggregation. If the data include such higher
levels, this is trivial, but otherwise I'd like to aggregate
fairly easily. The following is not functioning code, but it
should give you the idea:
start with a data frame (call it d) having row.names = to the 6
digit NAICS code and columns w/ various variables, assume one is
named employment.
d[,"employment"]??? ??? ??? ??? ??? ?? # Would print all
employment data
d["441222","employment"]??? ??? # Would print only Boat Dealer
employment
d["44","employment]??? ??? ??? ???? # Would print total
employment for Retail Trade
d[,"employment"] #prints all employment data
d[rownames(d) == "441222","employment"] #prints only boat dealer
employment
d[grep("^44", rownames(d)),"employment"] # prints total employment
for
retail trade
Recursive nesting. I'm not sure how to convey this except with
examples. Suppose the data frame also has a "wages" column with
average weekly wages in the industry, and the industry code is
also a factor variable (industry). So a simple analysis of
variance might look like:
??? ??? ??? ??? ??? w<- aov(wages ~ industry, d)
??? ??? But now what I'd like to do is to break this down within
2-digit sectors. Assuming the data frame has another variable,
industry 2, this would look like:
??? ??? ??? ??? ??? w<- aov(wages ~ industry2/industry)
???? ??? But what if we either (a) don't want to bother creating
separate variables for each level of aggregation in industry or
(b) want to extended the model formula language to include
various nesting strategies. This might look like:
??? ??? ??? ??? ??? w<- aov(wages ~ industry//
*)??? ??? ??? ??? ??? # Nest all meaningful levels industry/
industry2/industry3/industry4/industry5/industry6. If the coding
system skips some levels, R is smart enough to omit the skipped
levels.
??? ??? ??? ??? ??? w<- aov(wages ~ industry//levels 2,4,6)???? #
I'm using "//" as a hypothetical extension to the model language
that is followed by a "levels" keyword and then a list of levels
within the hierarchy. This example would expand
??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ?? # to
aov(wages ~ industry2/industry4/industry6)
??? ??? One could extend this last example to include a notation
allowing the analysis to be repeated at varying levels of depth
(e.g., industry||2,6) would repeat the ANOVA for industry2 and
industry6)
I can see how that might be useful. But it is easy enough to split
the
variables out, for example (assuming that each level consists of two
digits):
d$ind1<- substr(rownames(d), 1,2)
d$ind2<- substr(rownames(d), 3,4)
d$ind2<- substr(rownames(d), 5,6)
Since the factor hierarchy is completely nested (i.e., every 6-
digit industry is below a 5 digit industry), a single function
can operate on the codes recursively. Three variants come to
mind. In the first, we'd use some kind of apply function to drill
down to a certain level and return a list of results, one for
each level:
??? ??? ??? ??? ? means<-
drill(wages,industry,mean)??? ??? ??? ??? ??? ??? # Would return
a list. The first component would a vector of mean wages for
industries at the 2-digit level, the second, a vector for the 3-
digit level, etc.
??? ??? ??? ??? ? means<-
drill(wages,industry,mean,maxlvl=3)??? ???? # Would stop at the
3rd level of the hierarchy (4-digit code). One could also imagine
a maxdigits optionas an alternative (maxdigits = y means stop at
the y-digit level)
Again, I can see how this would be useful, but it's already pretty
easy (once we have split out the grouping variables) to do something
like
grp.means<- list(
l1 = aggregate(d$wages, list(d$ind1), mean),
l2 = aggregate(d$wages, list(d$ind2), mean),
l3 = aggregate(d$wages, list(d$ind3), mean)
)
I know this wasn't what you were looking for (as I said, I'm not
aware
of any package that implements the functionality you describe). But
the existing facilities in R are quite flexible, and handling this
kind of data in R is already fairly straightforward.
Best,
Ista
Second, suppose we have a data frame like d, only this time it's
a time series (each row is a different date). Now we might want
to generate vectors of the rate of change in employment at each
industry level. It might look like:
??? rate<- function(x) { (x - lag(x))/lag(x)) }
??? rates<- as.list()
??? i<- 1
??? rates<- for j %in% levels(industry)?
{?? ??? ??? ??? ??? ??? ??? ??? ? ?? ??? ??? ??? # The levels
function parses the hierarchical factor into the various levels
of its coding system
??? ??? ??? ??? ??? rates[[i]]<- rate(emplyment[,level(industry)
== j])??? ??? ???? # The level function sets a particular one of
these levels
??? ??? ??? ??? ??? i<- i + 1
??? ??? ??? ??? }
A third variant would be a genuinely recursive function that
keeps on calling itself at each level of the factor until it has
either reached a pre-specified depth or exhausted all levels of
the factor.
I hope this gives you a good idea of the sorts of things one
might do with hierarchical factors.
??? Marsh Feldman
On 5/3/2010 9:57 AM, Ista Zahn wrote:
Hi Marshell,
What exactly do you mean by "handles this kind of data structure"?
What do you want R to do?
Best,
Ista
On Mon, May 3, 2010 at 9:44 AM, Marshall Feldman<ma...@uri.edu>
wrote:
Hello,
Hierarchical factors are a very common data structure. For
instance, one
might have municipalities within states within countries within
continents. Other examples include occupational codes, biological
species, software types (R within statistical software within
analytical
software), etc.
Such data structures commonly use hierarchical coding systems. For
example, the 2007 North American Industry Classification System
(NAICS)
<http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>has
twenty
two-digit codes (e.g., 42 = Wholesale trade), within each of these
varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers,
durable goods), then varying numbers of 4-digit codes (4231 = Motor
Vehicle and Motor Vehicle Parts and Supplies Merchant
Wholesalers), then
varying numbers of five-digit codes, varying numbers of six-digit
codes,
etc. At the lowest level (longest code) one can readily tell all
the
higher levels. For example, 441222 is "Boat Dealers" who are part
of
44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," which
is
part of 4412 (Other Motor Vehicle Dealers), which is part of 441
(Motor
Vehicle and Parts Dealers), which is part of 44 (Retail Trade).
(The US
Census Bureau has extended the 6-digit NAICS to an even more
fine-grained 10-digit system.)
I haven't seen any R packages or sample code that handles this
kind of
data, but I don't want to reinvent the wheel and would rather
stand on
the shoulders of you giants. Is there any package or other R-based
software out there that handles this kind of data structure?
? ? Thanks,
? ? Marsh Feldman
? ? ? ?[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
Center for Urban Studies and Research
The University of Rhode Island
email: marsh @ uri .edu (remove spaces)
Contact Information:
Kingston:
202 Hart House
Charles T. Schmidt Labor Research Center
The University of Rhode Island
36 Upper College Road
Kingston, RI 02881-0815
tel. (401) 874-5953:
fax: (401) 874-5511
Providence:
206E Shepard Building
URI Feinstein Providence Campus
80 Washington Street
Providence, RI 02903-1819
tel. (401) 277-5218
fax: (401) 277-5464
--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org
--
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
CUSR Logo
Center for Urban Studies and Research
The University of Rhode Island
email: marsh @ uri .edu (remove spaces)
Contact Information:
Kingston:
202 Hart House
Charles T. Schmidt Labor Research Center
The University of Rhode Island
36 Upper College Road
Kingston, RI 02881-0815
tel. (401) 874-5953:
fax: (401) 874-5511
Providence:
206E Shepard Building
URI Feinstein Providence Campus
80 Washington Street
Providence, RI 02903-1819
tel. (401) 277-5218
fax: (401) 277-5464
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.