On May 6, 2010, at 7:13 AM, Marshall Feldman wrote:

On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
I think you are perhaps unintentionally obscuring two issues. One is whether R might have the statistical functions to deal with such an arrangement, and here "mixed models" would be the phrase you ought to be watching for, while the other would be whether it would have pre-written data management functions that would directly support the particular data layout you might be getting from public- access gov't files. The second is what I _thought_ you were soliciting in your original posting. I was a bit surprised that no one mentioned the survey package, since I have seen it used in such situations, but I cannot track down the citation at the moment. You might want to look at Gelman's blogs:

http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html

See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1

And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis Techniques in Health Policy Data"
R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/ RJournal_2009-2_Damico.pdf

First, I apologize for my last, somewhat incoherent post. I was composing it late at night, grew too tired to think, and thought I left it open to finish this morning. Looks as if I should have quit about an hour earlier since apparently the garbled message went out anyway.

Dave, you're right, although I would describe my question as combining rather than obscuring two issues. My thinking is that first one would want the data structure (actually a data type or class). A set of functions could then handle conversion to factors, etc. that would allow easy use of most existing statistical functions. New statistical functions could then be designed, or old ones retrofitted, to handle the new data type internally. Eventually, it would be great to integrate it into the formula language.

The data type would have an inheritance pattern sort of like this: factor -> hierarchy -> specific system. By "specific system" I mean either a standard or user-defined coding system that extends the hierarchy class. For example, NAICS would be a data type and any variable in this class would be both hierarchical and map to the labels associated with the industry definitions. The hierarchy class would be what I was describing, with information on how to parse individual character strings at various levels of aggregation. Finally, although my idea would extend R's factor data type, strictly speaking this would not be inheritance. Real factors replicate and include labels in the storage associated with individual variables. Most hierarchical systems are very large, including hundreds of levels and long labels. So factors would usually be a very inefficient way to handle them. Imagine, for example, an application analyzing Internet routing or airline traffic, with each node on a route having a spatial hierarchical code (country.state.county.city) and a separate variable for each node. Ugh!

Instead, my idea would be to use an approach similar to SAS's formats, where the labels are stored separately and the individual codes map through a few relatively simple algorithms. SAS, for example, maps codes to labels either 1:1 (a character representation of the code maps to a label) or by evaluating the code and mapping it according to a predefined range of values. SAS recently implemented a feature that allows 1:many mapping so that, for instance, an AGE variable could map to simultaneously map to "Adult" and "Senior Citizen." Some statistical procedures in SAS will now repeat the analysis for all the mappings, so a single call to describe a variable generates counts of both adults and seniors.

While something similar to SAS formats would itself be a useful addition to R (and has been discussed before), my idea extends this by adding the ability to parse a hierarchical code at its various levels. This could then be integrated into appropriate statistical functions, or the analyst could write a function to deparse the code into its levels and then call the statistical function as needed. At a minimum, the hierarchy class would have to include an as.factor() function.


I have seen statements that R and ROOT can be compiled together on the same machine. ROOT is an object oriented database system developed at CERN (also where the WWW started) that supports hierarchical organization of data:

http://en.wikipedia.org/wiki/ROOT

The BioConductor "project" ought to be considered as a potential source of coding, and the geospatial interest group as well.

See for instance the xps package in BioC
http://bioconductor.org/packages/release/bioc/html/xps.html
http://www.iscb.org/uploaded/css/G04Stratowa.pdf

You might try corresponding with the xps author Christian Stratowa.


Given R's thousands of packages, I sent my post to find out if something like this already existed.

Thanks to everyone for your feedback. This list is great! The answer to my question is:

> answer <- little.red.hen(question)

Marsh Feldman

David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to