I don't know about that, but try this : install.packages("data.table", repos="http://R-Forge.R-project.org") require(data.table) summaries = data.table(summaries) summaries[,sum(counts),by=symbol]
Please let us know if that returns the correct result, and if its memory/speed is ok ? Matthew "Steve Lianoglou" <mailinglist.honey...@gmail.com> wrote in message news:w2kbbdc7ed01004290606lc425e47cs95b36f6bf0a...@mail.gmail.com... > Hi all, > > In short: > > I'm running ddply on an admittedly (somehow) large data.frame (not > that large). It runs fine until it finishes and gets to the > "collating" part where all subsets of my data.frame have been > summarized and they are being reassembled into the final summary > data.frame (sorry, don't know the correct plyr terminology). During > collation, my R workspace RAM usage goes from about 1.5 GB upto 20GB > until I kill it. > > Running a similar piece of code that iterates manually w/o ddply by > using a combo of lapply and a do.call(rbind, ...) uses considerably > less ram (tops out at about 8GB). > > How can I use ddply more efficiently? > > Longer: > > Here's more info: > > * The data.frame itself ~ 15.8 MB when loaded. > * ~ 400,000 rows, 8 columns > > It looks like so: > > exon.start exon.width exon.width.unique exon.anno counts > symbol transcript chr > 1 4225 468 0 utr 0 > WASH5P WASH5P chr1 > 2 4833 69 0 utr 1 > WASH5P WASH5P chr1 > 3 5659 152 38 utr 1 > WASH5P WASH5P chr1 > 4 6470 159 0 utr 0 > WASH5P WASH5P chr1 > 5 6721 198 0 utr 0 > WASH5P WASH5P chr1 > 6 7096 136 0 utr 0 > WASH5P WASH5P chr1 > 7 7469 137 0 utr 0 > WASH5P WASH5P chr1 > 8 7778 147 0 utr 0 > WASH5P WASH5P chr1 > 9 8131 99 0 utr 0 > WASH5P WASH5P chr1 > 10 14601 154 0 utr 0 > WASH5P WASH5P chr1 > 11 19184 50 0 utr 0 > WASH5P WASH5P chr1 > 12 4693 140 36 intron 2 > WASH5P WASH5P chr1 > 13 4902 757 36 intron 1 > WASH5P WASH5P chr1 > 14 5811 659 144 intron 47 > WASH5P WASH5P chr1 > 15 6629 92 21 intron 1 > WASH5P WASH5P chr1 > 16 6919 177 0 intron 0 > WASH5P WASH5P chr1 > 17 7232 237 35 intron 2 > WASH5P WASH5P chr1 > 18 7606 172 0 intron 0 > WASH5P WASH5P chr1 > 19 7925 206 0 intron 0 > WASH5P WASH5P chr1 > 20 8230 6371 109 intron 67 > WASH5P WASH5P chr1 > 21 14755 4429 55 intron 12 > WASH5P WASH5P chr1 > ... > > I'm "ply"-ing over the "transcript" column and the function transforms > each such subset of the data.frame into a new data.frame that is just > 1 row / transcript that basically has the sum of the "counts" for each > transcript. > > The code would look something like this (`summaries` is the data.frame > I'm referring to): > > rpkm <- ddply(summaries, .(transcript), function(df) { > data.frame(symbol=df$symbol[1], counts=sum(df$counts)) > } > > (It actually calculates 2 more columns that are returned in the > data.frame, but I'm not sure that's really important here). > > To test some things out, I've written another function to manually > iterate/create subsets of my data.frame to summarize. > > I'm using sqldf to dump the data.frame into a db, then I lapply over > subsets of the db `where transcript=x` to summarize each subset of my > data into a list of single-row data.frames (like ddply is doing), and > finish with a `do.call(rbind, the.dfs)` o nthis list. > > This returns the same exact result ddply would return, and by the time > `do.call` finishes, my RAM usage hits about 8gb. > > So, what am I doing wrong with ddply that makes the difference ram > usage in the last step ("collation" -- the equivalent of my final > `do.call(rbind, my.dfs)` be more than 12GB? > > Thanks, > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.