Re: [R] Hierarchical factors

David Winsemius Thu, 06 May 2010 05:24:13 -0700


On May 6, 2010, at 7:13 AM, Marshall Feldman wrote:

On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
I think you are perhaps unintentionally obscuring two issues. Oneis whether R might have the statistical functions to deal with suchan arrangement, and here "mixed models" would be the phrase youought to be watching for, while the other would be whether it wouldhave pre-written data management functions that would directlysupport the particular data layout you might be getting from public-access gov't files. The second is what I _thought_ you weresoliciting in your original posting. I was a bit surprised that noone mentioned the survey package, since I have seen it used in suchsituations, but I cannot track down the citation at the moment.You might want to look at Gelman's blogs:
http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html

See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1

And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN AnalysisTechniques in Health Policy Data"
R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
First, I apologize for my last, somewhat incoherent post. I wascomposing it late at night, grew too tired to think, and thought Ileft it open to finish this morning. Looks as if I should have quitabout an hour earlier since apparently the garbled message went outanyway.
Dave, you're right, although I would describe my question ascombining rather than obscuring two issues. My thinking is thatfirst one would want the data structure (actually a data type orclass). A set of functions could then handle conversion to factors,etc. that would allow easy use of most existing statisticalfunctions. New statistical functions could then be designed, or oldones retrofitted, to handle the new data type internally.Eventually, it would be great to integrate it into the formulalanguage.
The data type would have an inheritance pattern sort of like this:factor -> hierarchy -> specific system. By "specific system" I meaneither a standard or user-defined coding system that extends thehierarchy class. For example, NAICS would be a data type and anyvariable in this class would be both hierarchical and map to thelabels associated with the industry definitions. The hierarchy classwould be what I was describing, with information on how to parseindividual character strings at various levels of aggregation.Finally, although my idea would extend R's factor data type,strictly speaking this would not be inheritance. Real factorsreplicate and include labels in the storage associated withindividual variables. Most hierarchical systems are very large,including hundreds of levels and long labels. So factors wouldusually be a very inefficient way to handle them. Imagine, forexample, an application analyzing Internet routing or airlinetraffic, with each node on a route having a spatial hierarchicalcode (country.state.county.city) and a separate variable for eachnode. Ugh!
Instead, my idea would be to use an approach similar to SAS'sformats, where the labels are stored separately and the individualcodes map through a few relatively simple algorithms. SAS, forexample, maps codes to labels either 1:1 (a character representationof the code maps to a label) or by evaluating the code and mappingit according to a predefined range of values. SAS recentlyimplemented a feature that allows 1:many mapping so that, forinstance, an AGE variable could map to simultaneously map to "Adult"and "Senior Citizen." Some statistical procedures in SAS will nowrepeat the analysis for all the mappings, so a single call todescribe a variable generates counts of both adults and seniors.
While something similar to SAS formats would itself be a usefuladdition to R (and has been discussed before), my idea extends thisby adding the ability to parse a hierarchical code at its variouslevels. This could then be integrated into appropriate statisticalfunctions, or the analyst could write a function to deparse the codeinto its levels and then call the statistical function as needed. Ata minimum, the hierarchy class would have to include an as.factor()function.

I have seen statements that R and ROOT can be compiled together on thesame machine. ROOT is an object oriented database system developed atCERN (also where the WWW started) that supports hierarchicalorganization of data:


http://en.wikipedia.org/wiki/ROOT

The BioConductor "project" ought to be considered as a potentialsource of coding, and the geospatial interest group as well.


See for instance the xps package in BioC
http://bioconductor.org/packages/release/bioc/html/xps.html
http://www.iscb.org/uploaded/css/G04Stratowa.pdf

You might try corresponding with the xps author Christian Stratowa.

Given R's thousands of packages, I sent my post to find out ifsomething like this already existed.
Thanks to everyone for your feedback. This list is great! The answerto my question is:
> answer <- little.red.hen(question)

Marsh Feldman


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Hierarchical factors

Reply via email to