Hi all, In short:
I'm running ddply on an admittedly (somehow) large data.frame (not that large). It runs fine until it finishes and gets to the "collating" part where all subsets of my data.frame have been summarized and they are being reassembled into the final summary data.frame (sorry, don't know the correct plyr terminology). During collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB until I kill it. Running a similar piece of code that iterates manually w/o ddply by using a combo of lapply and a do.call(rbind, ...) uses considerably less ram (tops out at about 8GB). How can I use ddply more efficiently? Longer: Here's more info: * The data.frame itself ~ 15.8 MB when loaded. * ~ 400,000 rows, 8 columns It looks like so: exon.start exon.width exon.width.unique exon.anno counts symbol transcript chr 1 4225 468 0 utr 0 WASH5P WASH5P chr1 2 4833 69 0 utr 1 WASH5P WASH5P chr1 3 5659 152 38 utr 1 WASH5P WASH5P chr1 4 6470 159 0 utr 0 WASH5P WASH5P chr1 5 6721 198 0 utr 0 WASH5P WASH5P chr1 6 7096 136 0 utr 0 WASH5P WASH5P chr1 7 7469 137 0 utr 0 WASH5P WASH5P chr1 8 7778 147 0 utr 0 WASH5P WASH5P chr1 9 8131 99 0 utr 0 WASH5P WASH5P chr1 10 14601 154 0 utr 0 WASH5P WASH5P chr1 11 19184 50 0 utr 0 WASH5P WASH5P chr1 12 4693 140 36 intron 2 WASH5P WASH5P chr1 13 4902 757 36 intron 1 WASH5P WASH5P chr1 14 5811 659 144 intron 47 WASH5P WASH5P chr1 15 6629 92 21 intron 1 WASH5P WASH5P chr1 16 6919 177 0 intron 0 WASH5P WASH5P chr1 17 7232 237 35 intron 2 WASH5P WASH5P chr1 18 7606 172 0 intron 0 WASH5P WASH5P chr1 19 7925 206 0 intron 0 WASH5P WASH5P chr1 20 8230 6371 109 intron 67 WASH5P WASH5P chr1 21 14755 4429 55 intron 12 WASH5P WASH5P chr1 ... I'm "ply"-ing over the "transcript" column and the function transforms each such subset of the data.frame into a new data.frame that is just 1 row / transcript that basically has the sum of the "counts" for each transcript. The code would look something like this (`summaries` is the data.frame I'm referring to): rpkm <- ddply(summaries, .(transcript), function(df) { data.frame(symbol=df$symbol[1], counts=sum(df$counts)) } (It actually calculates 2 more columns that are returned in the data.frame, but I'm not sure that's really important here). To test some things out, I've written another function to manually iterate/create subsets of my data.frame to summarize. I'm using sqldf to dump the data.frame into a db, then I lapply over subsets of the db `where transcript=x` to summarize each subset of my data into a list of single-row data.frames (like ddply is doing), and finish with a `do.call(rbind, the.dfs)` o nthis list. This returns the same exact result ddply would return, and by the time `do.call` finishes, my RAM usage hits about 8gb. So, what am I doing wrong with ddply that makes the difference ram usage in the last step ("collation" -- the equivalent of my final `do.call(rbind, my.dfs)` be more than 12GB? Thanks, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.