Hi Marshall,
I'm not aware of any packages that implement these features as you
described them. But most of the tasks are already fairly easy in R --
see below.
On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <[email protected]> wrote:
>
> Thanks for getting back so quickly Ista,
>
> I was actually casting about for any examples of R software that deals with
> this kind of structure. But your question is a good one. Here are a few
> things I'd like to be able to do:
>
> Store data in R at the finest level of detail but easily refer to higher
> levels of aggregation. If the data include such higher levels, this is
> trivial, but otherwise I'd like to aggregate fairly easily. The following is
> not functioning code, but it should give you the idea:
>
> start with a data frame (call it d) having row.names = to the 6 digit NAICS
> code and columns w/ various variables, assume one is named employment.
> d[,"employment"] # Would print all employment data
> d["441222","employment"] # Would print only Boat Dealer employment
> d["44","employment] # Would print total employment for Retail
> Trade
d[,"employment"] #prints all employment data
d[rownames(d) == "441222","employment"] #prints only boat dealer employment
d[grep("^44", rownames(d)),"employment"] # prints total employment for
retail trade
>
> Recursive nesting. I'm not sure how to convey this except with examples.
> Suppose the data frame also has a "wages" column with average weekly wages in
> the industry, and the industry code is also a factor variable (industry). So
> a simple analysis of variance might look like:
>
> w <- aov(wages ~ industry, d)
>
> But now what I'd like to do is to break this down within 2-digit
> sectors. Assuming the data frame has another variable, industry 2, this would
> look like:
>
> w <- aov(wages ~ industry2/industry)
>
> But what if we either (a) don't want to bother creating separate
> variables for each level of aggregation in industry or (b) want to extended
> the model formula language to include various nesting strategies. This might
> look like:
>
> w <- aov(wages ~ industry//*) # Nest
> all meaningful levels
> industry/industry2/industry3/industry4/industry5/industry6. If the coding
> system skips some levels, R is smart enough to omit the skipped levels.
> w <- aov(wages ~ industry//levels 2,4,6) # I'm using
> "//" as a hypothetical extension to the model language that is followed by a
> "levels" keyword and then a list of levels within the hierarchy. This example
> would expand
>
> # to aov(wages ~ industry2/industry4/industry6)
>
> One could extend this last example to include a notation allowing the
> analysis to be repeated at varying levels of depth (e.g., industry||2,6)
> would repeat the ANOVA for industry2 and industry6)
>
I can see how that might be useful. But it is easy enough to split the
variables out, for example (assuming that each level consists of two
digits):
d$ind1 <- substr(rownames(d), 1,2)
d$ind2 <- substr(rownames(d), 3,4)
d$ind2 <- substr(rownames(d), 5,6)
> Since the factor hierarchy is completely nested (i.e., every 6-digit industry
> is below a 5 digit industry), a single function can operate on the codes
> recursively. Three variants come to mind. In the first, we'd use some kind of
> apply function to drill down to a certain level and return a list of results,
> one for each level:
>
> means <- drill(wages,industry,mean)
> # Would return a list. The first component would a vector of mean wages for
> industries at the 2-digit level, the second, a vector for the 3-digit level,
> etc.
> means <- drill(wages,industry,mean,maxlvl=3) #
> Would stop at the 3rd level of the hierarchy (4-digit code). One could also
> imagine a maxdigits optionas an alternative (maxdigits = y means stop at the
> y-digit level)
>
Again, I can see how this would be useful, but it's already pretty
easy (once we have split out the grouping variables) to do something
like
grp.means <- list(
l1 = aggregate(d$wages, list(d$ind1), mean),
l2 = aggregate(d$wages, list(d$ind2), mean),
l3 = aggregate(d$wages, list(d$ind3), mean)
)
I know this wasn't what you were looking for (as I said, I'm not aware
of any package that implements the functionality you describe). But
the existing facilities in R are quite flexible, and handling this
kind of data in R is already fairly straightforward.
Best,
Ista
> Second, suppose we have a data frame like d, only this time it's a time
> series (each row is a different date). Now we might want to generate vectors
> of the rate of change in employment at each industry level. It might look
> like:
>
> rate <- function(x) { (x - lag(x))/lag(x)) }
> rates <- as.list()
> i <- 1
> rates <- for j %in% levels(industry) {
> # The levels function parses the hierarchical factor into the
> various levels of its coding system
> rates[[i]] <- rate(emplyment[,level(industry) == j])
> # The level function sets a particular one of these levels
> i <- i + 1
> }
>
> A third variant would be a genuinely recursive function that keeps on calling
> itself at each level of the factor until it has either reached a
> pre-specified depth or exhausted all levels of the factor.
>
> I hope this gives you a good idea of the sorts of things one might do with
> hierarchical factors.
>
> Marsh Feldman
>
>
>
> On 5/3/2010 9:57 AM, Ista Zahn wrote:
>
> Hi Marshell,
> What exactly do you mean by "handles this kind of data structure"?
> What do you want R to do?
>
> Best,
> Ista
>
> On Mon, May 3, 2010 at 9:44 AM, Marshall Feldman <[email protected]> wrote:
>
>
> Hello,
>
> Hierarchical factors are a very common data structure. For instance, one
> might have municipalities within states within countries within
> continents. Other examples include occupational codes, biological
> species, software types (R within statistical software within analytical
> software), etc.
>
> Such data structures commonly use hierarchical coding systems. For
> example, the 2007 North American Industry Classification System (NAICS)
> <http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>has twenty
> two-digit codes (e.g., 42 = Wholesale trade), within each of these
> varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers,
> durable goods), then varying numbers of 4-digit codes (4231 = Motor
> Vehicle and Motor Vehicle Parts and Supplies Merchant Wholesalers), then
> varying numbers of five-digit codes, varying numbers of six-digit codes,
> etc. At the lowest level (longest code) one can readily tell all the
> higher levels. For example, 441222 is "Boat Dealers" who are part of
> 44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," which is
> part of 4412 (Other Motor Vehicle Dealers), which is part of 441 (Motor
> Vehicle and Parts Dealers), which is part of 44 (Retail Trade). (The US
> Census Bureau has extended the 6-digit NAICS to an even more
> fine-grained 10-digit system.)
>
> I haven't seen any R packages or sample code that handles this kind of
> data, but I don't want to reinvent the wheel and would rather stand on
> the shoulders of you giants. Is there any package or other R-based
> software out there that handles this kind of data structure?
>
> Thanks,
> Marsh Feldman
>
>
>
>
>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>
> --
> Dr. Marshall Feldman, PhD
> Director of Research and Academic Affairs
> Center for Urban Studies and Research
> The University of Rhode Island
> email: marsh @ uri .edu (remove spaces)
>
> Contact Information:
>
> Kingston:
>
> 202 Hart House
> Charles T. Schmidt Labor Research Center
> The University of Rhode Island
> 36 Upper College Road
> Kingston, RI 02881-0815
> tel. (401) 874-5953:
> fax: (401) 874-5511
>
> Providence:
>
> 206E Shepard Building
> URI Feinstein Providence Campus
> 80 Washington Street
> Providence, RI 02903-1819
> tel. (401) 277-5218
> fax: (401) 277-5464
--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.