Here's some more timing's of Bill's function. Although in this example sapply has a clear performance advantage for smaller numbers of groups (k) , gm is substantially faster for k >> 1000:
gm <- function(x, group){ # medians by group: o<-order(group, x) group <- group[o] x <- x[o] changes <- group[-1] != group[-length(group)] first <- which(c(TRUE, changes)) last <- which(c(changes, TRUE)) lowerMedian <- x[floor((first+last)/2)] upperMedian <- x[ceiling((first+last)/2)] median <- (lowerMedian+upperMedian)/2 names(median) <- group[first] median } ## for(k in 10^(1:6)){ group<-sample(1:k, size=100000, replace=TRUE) x<-rnorm(length(group))*10 + group cat('number of groups:', k, '\n') cat('sapply\n') print(s <- unix.time(sapply(split(x,group), median))) cat('gm\n') print(g <- unix.time(-gm(x,group))) cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n') } number of groups: 10 sapply user system elapsed 0.01 0.00 0.01 gm user system elapsed 0.14 0.00 0.16 sapply/gm user ratio: 0.07142857 number of groups: 100 sapply user system elapsed 0.02 0.00 0.02 gm user system elapsed 0.11 0.00 0.09 sapply/gm user ratio: 0.1818182 number of groups: 1000 sapply user system elapsed 0.11 0.00 0.11 gm user system elapsed 0.11 0.00 0.11 sapply/gm user ratio: 1 number of groups: 10000 sapply user system elapsed 1.00 0.00 1.04 gm user system elapsed 0.10 0.00 0.09 sapply/gm user ratio: 10 number of groups: 1e+05 sapply user system elapsed 6.24 0.01 6.31 gm user system elapsed 0.16 0.00 0.17 sapply/gm user ratio: 39 number of groups: 1e+06 sapply user system elapsed 9.00 0.03 8.92 gm user system elapsed 0.15 0.00 0.20 sapply/gm user ratio: 60 > R.Version() $platform [1] "i386-pc-mingw32" $arch [1] "i386" $os [1] "mingw32" $system [1] "i386, mingw32" $status [1] "" $major [1] "2" $minor [1] "8.0" $year [1] "2008" $month [1] "10" $day [1] "20" $`svn rev` [1] "46754" $language [1] "R" $version.string [1] "R version 2.8.0 (2008-10-20)" Kingsford Jones On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap <wdun...@tibco.com> wrote: > Arg, the 'sapply(...)' in the function was in the initial > comment, > gm <- function(x, group){ # medians by group: > sapply(split(x,group),median) > but someone's mailer put a newline before the sapply > gm <- function(x, group){ # medians by group: > sapply(split(x,group),median) > so it got executed, wasting all that time. (Of course the > same mailer will mess up this message in the same way - > just delete the sapply() call from gm().) > > Bill Dunlap > TIBCO Software Inc - Spotfire Division > wdunlap tibco.com > >> -----Original Message----- >> From: hadley wickham [mailto:h.wick...@gmail.com] >> Sent: Monday, January 05, 2009 9:10 AM >> To: William Dunlap >> Cc: gallon...@gmail.com; R help >> Subject: Re: [R] the first and last observation for each subject >> >> > Another application of that technique can be used to quickly compute >> > medians by groups: >> > >> > gm <- function(x, group){ # medians by group: >> > sapply(split(x,group),median) >> > o<-order(group, x) >> > group <- group[o] >> > x <- x[o] >> > changes <- group[-1] != group[-length(group)] >> > first <- which(c(TRUE, changes)) >> > last <- which(c(changes, TRUE)) >> > lowerMedian <- x[floor((first+last)/2)] >> > upperMedian <- x[ceiling((first+last)/2)] >> > median <- (lowerMedian+upperMedian)/2 >> > names(median) <- group[first] >> > median >> > } >> > >> > For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups >> > (in random order) the times are: >> > >> >> group<-sample(1:30000, size=100000, replace=TRUE) >> >> x<-rnorm(length(group))*10 + group >> >> unix.time(z0<-sapply(split(x,group), median)) >> > user system elapsed >> > 2.72 0.00 3.20 >> >> unix.time(z1<-gm(x,group)) >> > user system elapsed >> > 0.12 0.00 0.16 >> >> identical(z1,z0) >> > [1] TRUE >> >> I get: >> >> > unix.time(z0<-sapply(split(x,group), median)) >> user system elapsed >> 2.733 0.017 2.766 >> > unix.time(z1<-gm(x,group)) >> user system elapsed >> 2.897 0.032 2.946 >> >> >> Hadley >> >> >> -- >> http://had.co.nz/ >> > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.