On May 6, 2010, at 7:13 AM, Marshall Feldman wrote:
On 5/5/10 [May 5, 10] 11:29 PM, David Winsemius wrote:
I think you are perhaps unintentionally obscuring two issues. One
is whether R might have the statistical functions to deal with such
an arrangement, and here "mixed models" would be the phrase you
ought to be watching for, while the other would be whether it would
have pre-written data management functions that would directly
support the particular data layout you might be getting from public-
access gov't files. The second is what I _thought_ you were
soliciting in your original posting. I was a bit surprised that no
one mentioned the survey package, since I have seen it used in such
situations, but I cannot track down the citation at the moment.
You might want to look at Gelman's blogs:
http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html
See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1
And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis
Techniques in Health Policy Data"
R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/
RJournal_2009-2_Damico.pdf
First, I apologize for my last, somewhat incoherent post. I was
composing it late at night, grew too tired to think, and thought I
left it open to finish this morning. Looks as if I should have quit
about an hour earlier since apparently the garbled message went out
anyway.
Dave, you're right, although I would describe my question as
combining rather than obscuring two issues. My thinking is that
first one would want the data structure (actually a data type or
class). A set of functions could then handle conversion to factors,
etc. that would allow easy use of most existing statistical
functions. New statistical functions could then be designed, or old
ones retrofitted, to handle the new data type internally.
Eventually, it would be great to integrate it into the formula
language.
The data type would have an inheritance pattern sort of like this:
factor -> hierarchy -> specific system. By "specific system" I mean
either a standard or user-defined coding system that extends the
hierarchy class. For example, NAICS would be a data type and any
variable in this class would be both hierarchical and map to the
labels associated with the industry definitions. The hierarchy class
would be what I was describing, with information on how to parse
individual character strings at various levels of aggregation.
Finally, although my idea would extend R's factor data type,
strictly speaking this would not be inheritance. Real factors
replicate and include labels in the storage associated with
individual variables. Most hierarchical systems are very large,
including hundreds of levels and long labels. So factors would
usually be a very inefficient way to handle them. Imagine, for
example, an application analyzing Internet routing or airline
traffic, with each node on a route having a spatial hierarchical
code (country.state.county.city) and a separate variable for each
node. Ugh!
Instead, my idea would be to use an approach similar to SAS's
formats, where the labels are stored separately and the individual
codes map through a few relatively simple algorithms. SAS, for
example, maps codes to labels either 1:1 (a character representation
of the code maps to a label) or by evaluating the code and mapping
it according to a predefined range of values. SAS recently
implemented a feature that allows 1:many mapping so that, for
instance, an AGE variable could map to simultaneously map to "Adult"
and "Senior Citizen." Some statistical procedures in SAS will now
repeat the analysis for all the mappings, so a single call to
describe a variable generates counts of both adults and seniors.
While something similar to SAS formats would itself be a useful
addition to R (and has been discussed before), my idea extends this
by adding the ability to parse a hierarchical code at its various
levels. This could then be integrated into appropriate statistical
functions, or the analyst could write a function to deparse the code
into its levels and then call the statistical function as needed. At
a minimum, the hierarchy class would have to include an as.factor()
function.
I have seen statements that R and ROOT can be compiled together on the
same machine. ROOT is an object oriented database system developed at
CERN (also where the WWW started) that supports hierarchical
organization of data:
http://en.wikipedia.org/wiki/ROOT
The BioConductor "project" ought to be considered as a potential
source of coding, and the geospatial interest group as well.
See for instance the xps package in BioC
http://bioconductor.org/packages/release/bioc/html/xps.html
http://www.iscb.org/uploaded/css/G04Stratowa.pdf
You might try corresponding with the xps author Christian Stratowa.
Given R's thousands of packages, I sent my post to find out if
something like this already existed.
Thanks to everyone for your feedback. This list is great! The answer
to my question is:
> answer <- little.red.hen(question)
Marsh Feldman
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.