Hi Bert, This is the best method I have seen this year! do.call, rbind has just gone to museum :)
It took ~30 second to get the results. You deserve a medal!!!! Jun On Fri, Sep 2, 2016 at 1:51 PM, Bert Gunter <bgunter.4...@gmail.com> wrote: > This is the sort of thing that dplyr or the data.table packages can > probably do elegantly and efficiently. So you might consider looking > at them. But as I use neither, let me suggest a base R solution. As > you supplied no data for a reproducible example, I'll make up my own > and hopefully I have understood you correctly. If not, maybe someone > else will get it straight. Anyway... > > The "trick" is to use tapply() to select the necessary row indices of > your data frame and forget about all the do.call and rbind stuff. e.g. > > > set.seed(1001) > > df <- data.frame(f =factor(sample(LETTERS[1:4],100,rep=TRUE)), > + g <- factor(sample(letters[1:6],100,rep=TRUE)), > + y = runif(100)) > > > > ix <- seq_len(nrow(df)) > > > > ix <- with(df,tapply(ix,list(f,g),function(x)x[length(x)])) > > ix > a b c d e f > A 94 69 100 59 80 87 > B 89 57 65 90 75 88 > C 85 92 86 95 97 62 > D 47 73 72 74 99 96 > > ## ix can now be used as an index into df as: > df[ix,] > > This should help somewhat, but you still have to contend with the > tapply() loop at the interpreted level. I'll leave speed comparisons > to you. > > Cheers, > Bert > > ## Note: if, in fact, your data frame is arranged in a regular way > with, e.g. your SID, DOSENO groups all of the same size and together, > then you can calculate the indices you want directly and skip the > tapply business.I'm assuming this is not the case... Again, no data... > > > > > > Bert Gunter > > "The trouble with having an open mind is that people keep coming along > and sticking things into it." > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) > > > On Fri, Sep 2, 2016 at 10:02 AM, Jun Shen <jun.shen...@gmail.com> wrote: > > Dear list, > > > > I have the following line of code to extract the last line of the split > > data and put them back together. > > > > do.call(rbind,lapply(split(simout.s1,simout.s1[c('SID',' > DOSENO')]),function(x)x[nrow(x),])) > > > > the problem is when have a huge dataset, it takes too long to run. > > (actually it's > 3 hours and it's still running). > > > > The dataset is pretty big. I have 200,000 unique SID and 4 DOSENO, so > > totally 800,000 split dataset. Is there anyway to speed it up? Thanks. > > > > Jun > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.