On Sun, Jan 31, 2010 at 10:24 PM, Anton du Toit <atdutoitrh...@gmail.com> wrote: > Dear R-helpers, > > I’m writing for advice on whether I should use R or a different package or > language. I’ve looked through the R-help archives, some manuals, and some > other sites as well, and I haven’t done too well finding relevant info, > hence my question here. > > I’m working with hierarchical data (in SPSS lingo). That is, for each case > (person) I read in three types of (medical) record: > > 1. demographic data: name, age, sex, address, etc > > 2. ‘admissions’ data: this generally repeats, so I will have 20 or so > variables relating to their first hospital admission, then the same 20 again > for their second admission, and so on > > 3. ‘collections’ data, about 100 variables containing the results of a > battery of standard tests. These are administered at intervals and so this > is repeating data as well. > > The number of repetitions varies between cases, so in its one case per line > format the data is non-rectangular. > > At present I have shoehorned all of this into SPSS, with each case on one > line. My test database has 2,500 variables and 1,500 cases (or persons), and > in SPSS’s *.SAV format is ~4MB. The one I finally work with will be larger > again, though likely within one order of magnitude. Down the track, funding > permitting, I hope to be working with tens of thousands of cases.
Although this may not be helpful for your immediate goal, storing and manipulating data of this size and complexity (and, I expect, cost for collection) really calls for tools like relational databases. A single flat file of 2500 variables by 1500 cases is almost never the best way to organize such data. A normalized representation as a collection of interlinked tables in a relational data base is much more effective and less error prone. The widespread use of spreadsheets or SPSS data sets or SAS data sets which encourage the "single table with a gargantuan number of columns, most of which are missing data in most cases" approach to organization of longitudinal data is regrettable. For later analysis in R it is better to start with "long" form of the data, as opposed to the "wide" form, even if it means repeating demographic information over several occasions. Using a relational database allows for a long view to be generated without the possibility of inconsistency in the demographics. I am using the descriptions "long" and "wide" in the sense that they are used in the reshape help page. See ?reshape in R. The long view is also called the subject/occasion view in the sense that each row corresponds to one subject on one occasion. Robert Gentleman's book "R Programming for Bioinformatics" provides background on linking R to relational databases. As I said at the beginning, you may not want to undertake the necessary study and effort to reorganize your data for this specific project but if you do this a lot you may want to consider it. > I am wondering if I should keep using SPSS, or try something else. > > The types of analysis I’ll typically will have to do will involve comparing > measurements at different times, e.g. before/ after treatment. I’ll also > need to compare groups of people, e.g. treatment / no treatment. Regression > and factor analyses will doubtless come into it at some point too. > > So: > > 1. should I use R or try something else? > > 2. can anyone advise me on using R with the type of data I’ve described? > > > Many thanks, > > Anton du Toit > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.