Re: [R] Hierarchical factors

David Winsemius Wed, 05 May 2010 20:30:34 -0700

I think you are perhaps unintentionally obscuring two issues. One iswhether R might have the statistical functions to deal with such anarrangement, and here "mixed models" would be the phrase you ought tobe watching for, while the other would be whether it would have pre-written data management functions that would directly support theparticular data layout you might be getting from public-access gov'tfiles. The second is what I _thought_ you were soliciting in youroriginal posting. I was a bit surprised that no one mentioned thesurvey package, since I have seen it used in such situations, but Icannot track down the citation at the moment. You might want to lookat Gelman's blogs:


http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html


See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1

And Damico's article:

"Transitioning to R: Replicating SAS, Stata, and SUDAAN AnalysisTechniques in Health Policy Data"

R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf

--
David.


On May 5, 2010, at 10:23 PM, Marshall Feldman wrote:

Thanks for sharing this, Ista.

I've come to the conclusion that R doesn't have what I'm looking for,
either in the base or the packages.

Although your examples are insightful, the examples we've been
discussing are deliberately easier than what one would expect in most
serious applications. Imagine for instance that we're studying wage
structures of industries in different geographic labor markets. We
therefore might have four variables: wages, industries, occupations,and
places. We might want to see if wage differentials are more or less
constant or if they are higher in some geographic areas than inothers.
Since industries, occupations, and places are typically coded
hierarchically as we've been discussing, we might want to figure outhow
to examine different wage levels within industries, etc. Doing this
manually would require lots of w
whereas conceptually  the

On 5/4/2010 6:00 AM,
Message: 49 Date: Mon, 3 May 2010 13:22:59 -0400 From: Ista Zahn
<istaz...@gmail.com> To: Marshall Feldman <ma...@uri.edu> Cc:
r-help@r-project.org Subject: Re: [R] Hierarchical factors Message-ID:
<x2xf55e7cf51005031022se4c46967s174efeef95331...@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1 Hi Marshall, I'm not
aware of any packages that implement these features as you described
them. But most of the tasks are already fairly easy in R -- seebelow.On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <ma...@uri.edu>wrote:
Thanks for getting back so quickly Ista,
I was actually casting about for any examples of R software thatdeals with this kind of structure. But your question is a goodone. Here are a few things I'd like to be able to do:
Store data in R at the finest level of detail but easily refer tohigher levels of aggregation. If the data include such higherlevels, this is trivial, but otherwise I'd like to aggregatefairly easily. The following is not functioning code, but itshould give you the idea:
start with a data frame (call it d) having row.names = to the 6digit NAICS code and columns w/ various variables, assume one isnamed employment.d[,"employment"]??? ??? ??? ??? ??? ?? # Would print allemployment datad["441222","employment"]??? ??? # Would print only Boat Dealeremploymentd["44","employment]??? ??? ??? ???? # Would print totalemployment for Retail Trade
d[,"employment"] #prints all employment data
d[rownames(d) == "441222","employment"] #prints only boat dealeremploymentd[grep("^44", rownames(d)),"employment"] # prints total employmentfor
retail trade
Recursive nesting. I'm not sure how to convey this except withexamples. Suppose the data frame also has a "wages" column withaverage weekly wages in the industry, and the industry code isalso a factor variable (industry). So a simple analysis ofvariance might look like:
??? ??? ??? ??? ??? w<- aov(wages ~ industry, d)
??? ??? But now what I'd like to do is to break this down within2-digit sectors. Assuming the data frame has another variable,industry 2, this would look like:
??? ??? ??? ??? ??? w<- aov(wages ~ industry2/industry)
???? ??? But what if we either (a) don't want to bother creatingseparate variables for each level of aggregation in industry or(b) want to extended the model formula language to includevarious nesting strategies. This might look like:
??? ??? ??? ??? ??? w<- aov(wages ~ industry//*)??? ??? ??? ??? ??? # Nest all meaningful levels industry/industry2/industry3/industry4/industry5/industry6. If the codingsystem skips some levels, R is smart enough to omit the skippedlevels.??? ??? ??? ??? ??? w<- aov(wages ~ industry//levels 2,4,6)???? #I'm using "//" as a hypothetical extension to the model languagethat is followed by a "levels" keyword and then a list of levelswithin the hierarchy. This example would expand??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ?? # toaov(wages ~ industry2/industry4/industry6)
??? ??? One could extend this last example to include a notationallowing the analysis to be repeated at varying levels of depth(e.g., industry||2,6) would repeat the ANOVA for industry2 andindustry6)
I can see how that might be useful. But it is easy enough to splitthe
variables out, for example (assuming that each level consists of two
digits):

  d$ind1<- substr(rownames(d), 1,2)
  d$ind2<- substr(rownames(d), 3,4)
  d$ind2<- substr(rownames(d), 5,6)
Since the factor hierarchy is completely nested (i.e., every 6-digit industry is below a 5 digit industry), a single functioncan operate on the codes recursively. Three variants come tomind. In the first, we'd use some kind of apply function to drilldown to a certain level and return a list of results, one foreach level:
??? ??? ??? ??? ? means<-drill(wages,industry,mean)??? ??? ??? ??? ??? ??? # Would returna list. The first component would a vector of mean wages forindustries at the 2-digit level, the second, a vector for the 3-digit level, etc.??? ??? ??? ??? ? means<-drill(wages,industry,mean,maxlvl=3)??? ???? # Would stop at the3rd level of the hierarchy (4-digit code). One could also imaginea maxdigits optionas an alternative (maxdigits = y means stop atthe y-digit level)
Again, I can see how this would be useful, but it's already pretty
easy (once we have split out the grouping variables) to do something
like

grp.means<- list(
l1 = aggregate(d$wages, list(d$ind1), mean),
l2 = aggregate(d$wages, list(d$ind2), mean),
l3 = aggregate(d$wages, list(d$ind3), mean)
)
I know this wasn't what you were looking for (as I said, I'm notaware
of any package that implements the functionality you describe). But
the existing facilities in R are quite flexible, and handling this
kind of data in R is already fairly straightforward.

Best,
Ista
Second, suppose we have a data frame like d, only this time it'sa time series (each row is a different date). Now we might wantto generate vectors of the rate of change in employment at eachindustry level. It might look like:
??? rate<- function(x) { (x - lag(x))/lag(x)) }
??? rates<- as.list()
??? i<- 1
??? rates<- for j %in% levels(industry)?{?? ??? ??? ??? ??? ??? ??? ??? ? ?? ??? ??? ??? # The levelsfunction parses the hierarchical factor into the various levelsof its coding system??? ??? ??? ??? ??? rates[[i]]<- rate(emplyment[,level(industry)== j])??? ??? ???? # The level function sets a particular one ofthese levels
??? ??? ??? ??? ??? i<- i + 1
??? ??? ??? ??? }
A third variant would be a genuinely recursive function thatkeeps on calling itself at each level of the factor until it haseither reached a pre-specified depth or exhausted all levels ofthe factor.
I hope this gives you a good idea of the sorts of things onemight do with hierarchical factors.
??? Marsh Feldman



On 5/3/2010 9:57 AM, Ista Zahn wrote:

Hi Marshell,
What exactly do you mean by "handles this kind of data structure"?
What do you want R to do?

Best,
Ista
On Mon, May 3, 2010 at 9:44 AM, Marshall Feldman<ma...@uri.edu>wrote:
Hello,
Hierarchical factors are a very common data structure. Forinstance, one
might have municipalities within states within countries within
continents. Other examples include occupational codes, biological
species, software types (R within statistical software withinanalytical
software), etc.

Such data structures commonly use hierarchical coding systems. For
example, the 2007 North American Industry Classification System(NAICS)<http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>hastwenty
two-digit codes (e.g., 42 = Wholesale trade), within each of these
varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers,
durable goods), then varying numbers of 4-digit codes (4231 = Motor
Vehicle and Motor Vehicle Parts and Supplies MerchantWholesalers), thenvarying numbers of five-digit codes, varying numbers of six-digitcodes,etc. At the lowest level (longest code) one can readily tell allthehigher levels. For example, 441222 is "Boat Dealers" who are partof44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," whichispart of 4412 (Other Motor Vehicle Dealers), which is part of 441(MotorVehicle and Parts Dealers), which is part of 44 (Retail Trade).(The US
Census Bureau has extended the 6-digit NAICS to an even more
fine-grained 10-digit system.)
I haven't seen any R packages or sample code that handles thiskind ofdata, but I don't want to reinvent the wheel and would ratherstand on
the shoulders of you giants. Is there any package or other R-based
software out there that handles this kind of data structure?

? ? Thanks,
? ? Marsh Feldman






? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org  mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





--
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
Center for Urban Studies and Research
The University of Rhode Island
email: marsh @ uri .edu (remove spaces)

Contact Information:

Kingston:

202 Hart House
Charles T. Schmidt Labor Research Center
The University of Rhode Island
36 Upper College Road
Kingston, RI 02881-0815
tel. (401) 874-5953:
fax: (401) 874-5511

Providence:

206E Shepard Building
URI Feinstein Providence Campus
80 Washington Street
Providence, RI 02903-1819
tel. (401) 277-5218
fax: (401) 277-5464
--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org
--
Dr. Marshall Feldman, PhD
Director of Research and Academic Affairs
CUSR Logo
Center for Urban Studies and Research
The University of Rhode Island
email: marsh @ uri .edu (remove spaces)


     Contact Information:


       Kingston:

202 Hart House
Charles T. Schmidt Labor Research Center
The University of Rhode Island
36 Upper College Road
Kingston, RI 02881-0815
tel. (401) 874-5953:
fax: (401) 874-5511


       Providence:

206E Shepard Building
URI Feinstein Providence Campus
80 Washington Street
Providence, RI 02903-1819
tel. (401) 277-5218
fax: (401) 277-5464

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Hierarchical factors

Reply via email to