thanks, josh. in my posting example, I did not need anything except coefficients. (when this is the case, I usually do not even use lm.fit, but I eliminate all missing obs first and then use solve crossprod(y,cbind(1,x)) crossprod(cbind(1,x)).) this is pretty fast.)
alas, I will need to figure how to get coef standard errors faster in this case. summary.lm() is really slow. regards, /iaw ---- Ivo Welch (ivo.we...@gmail.com) http://www.ivo-welch.info/ J. Fred Weston Professor of Finance Anderson School at UCLA, C519 On Mon, Oct 10, 2011 at 11:30 PM, Joshua Wiley <jwiley.ps...@gmail.com> wrote: > As another followup, given that you are doing numerous regression > models and (I presume) working with finance/stock data that is > strictly numeric (no need for special contrast coding, etc.), you can > substantially reduce the time spent estimating the coefficients. A > simple way is to use lm.fit directly instead of lm. For lm.fit, you > pass the y and x (design) matrices directly. This skips a good deal > of overhead. Here is one naive way, I imagine more speedups could be > gained by incorporating the intercept (1 vector) into d instead of > cbind()ing it. The catch it that lm.fit requires matrices, not data > tables, so what you gain may be lost in having to do an extra > conversion. In any case, here are the times on my system for the two > options (note I used N = 1000 * 100 because I am presently on a > glorified netbook). > >> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~ > + x, data=d[.indx,])) }))) > user system elapsed > 69.00 0.00 69.56 > >> print(system.time(all.2c <- lapply(si, function(.indx) { coef(lm.fit(y = >> d[.indx, y], x = cbind(1, d[.indx, x]))) }))) > user system elapsed > 37.83 0.03 38.36 > > the column names for the coeficients will not be the same as from lm, > but the estimates should be identical. While this is not recommended > in typical usage, in an application like regressions on rolling time > windows, etc. where you know the data are not changing, I think it > makes sense to bypass the clever determine your data and best methods > to use, and go straight to passing the design matrix. Since you do > not need residuals, variances, etc. it may be possible to speed this > up even more, perhaps bypassing dqrls altogether. > > Cheers, > > Josh > > On Mon, Oct 10, 2011 at 9:56 PM, ivo welch <ivo.we...@gmail.com> wrote: >> thank you, everyone. this was very helpful to my specific task and >> understanding. for the benefit of future googlers, I thought I would >> post some experiments and results here. >> >> ultimately, I need to do a by() on an irregular matrix, and I now know >> how to speed up by() on a single-core, and then again on a multi-core >> machine. >> >> library(data.table) >> N <- 1000*1000 >> d <- data.table(data.frame( key= as.integer(runif(N, min=1, >> max=N/10)), x=rnorm(N), y=rnorm(N) )) # irregular >> setkey(d, "key"); gc() ## sort and force a garbage collection >> >> >> cat("N=", N, ". Size of d=", object.size(d)/1024/1024, "MB\n") >> >> cat("\nStandard by() Function:\n") >> print(system.time( all.1 <- by( d, d$key, function(d) coef(lm(y ~ x, >> data=d))))) >> >> >> cat("\n\nPreSplit Function [aka Jim H]\n\t(a) Splitting Operation:\n") >> print(system.time(si <- split(seq(nrow(d)), d$key))) >> cat("\n\t(b) Regressions:\n") >> print(system.time(all.2 <- lapply(si, function(.indx) { >> coef(lm(d$y[.indx] ~ d$x[.indx])) }))) >> print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~ >> x, data=d[.indx,])) }))) >> >> cat("\n\nNaive Split Data Frame\n\t(a) Splitting Operation:\n") >> print(system.time(ds <- split(d, d$key))) >> cat("\n\t(b) Regressions:\n") >> print(system.time(all.3a <- lapply(ds, function(ds) { coef(lm(ds$y ~ ds$x)) >> }))) >> print(system.time(all.3b <- lapply(ds, function(ds) { coef(lm(y ~ x, >> data=ds)) }))) >> >> the first and the last ways (all.1 and all.3) are "naive" ways of >> doing this, and take about 400-500 seconds on a Mac Air, core i5. >> Jim's suggestion (all.2) cuts this roughly into half by speeding up >> the split to take almost no time. >> >> and now, >> >> library(multicore) >> print(system.time(all.4 <- mclapply(si, function(.indx) { coef(lm(y ~ >> x, data=d[.indx,])) }))) >> >> on my dual-core (quad-thread) i5, all four pseudo cores become busy, >> and the time roughly halves again from 230 seconds to 120 seconds. >> >> >> maybe the by() function should use Jim's approach, and multicore >> should provide mcby(). of course, knowing how to do this myself fast >> now by hand, this is not so important for me. but it may help some >> other novices. >> >> thanks again everybody. >> >> regards, >> >> /iaw >> >> ---- >> Ivo Welch (ivo.we...@gmail.com) >> >> >> >> >> On Mon, Oct 10, 2011 at 9:31 PM, William Dunlap <wdun...@tibco.com> wrote: >>> The following avoids the overhead of data.frame methods >>> (and assumes the data.frame doesn't include matrices >>> or other data.frames) and relies on split(vector,factor) >>> quickly splitting a vector into a list of vectors. >>> For a 10^6 row by 10 column data.frame split in 10^5 >>> groups this took 14.1 seconds while split took 658.7 s. >>> Both returned the same thing. >>> >>> Perhaps something based on this idea would help your >>> parallelized by(). >>> >>> mysplit.data.frame <- >>> function (x, f, drop = FALSE, ...) >>> { >>> f <- as.factor(f) >>> tmp <- lapply(x, function(xi) split(xi, f, drop = drop, ...)) >>> rn <- split(rownames(x), f, drop = drop, ...) >>> tmp <- unlist(unname(tmp), recursive = FALSE) >>> tmp <- split(tmp, factor(names(tmp), levels = unique(names(tmp)))) >>> tmp <- lapply(setNames(seq_along(tmp), names(tmp)), function(i) { >>> t <- tmp[[i]] >>> names(t) <- names(x) >>> attr(t, "row.names") <- rn[[i]] >>> class(t) <- "data.frame" >>> t >>> }) >>> tmp >>> } >>> >>> Bill Dunlap >>> Spotfire, TIBCO Software >>> wdunlap tibco.com >>> >>>> -----Original Message----- >>>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] >>>> On Behalf Of Jim Holtman >>>> Sent: Monday, October 10, 2011 7:29 PM >>>> To: ivo welch >>>> Cc: r-help >>>> Subject: Re: [R] SLOW split() function >>>> >>>> instead of spliting the entire dataframe, split the indices and then use >>>> these to access your data: >>>> try >>>> >>>> system.time(s <- split(seq(nrow(d)), d$key)) >>>> >>>> this should be faster and less memory intensive. you can then use the >>>> indices to access the subset: >>>> >>>> result <- lapply(s, function(.indx){ >>>> doSomething <- sum(d$someCol[.indx]) >>>> }) >>>> >>>> Sent from my iPad >>>> >>>> On Oct 10, 2011, at 21:01, ivo welch <ivo.we...@gmail.com> wrote: >>>> >>>> > dear R experts: apologies for all my speed and memory questions. I >>>> > have a bet with my coauthors that I can make R reasonably efficient >>>> > through R-appropriate programming techniques. this is not just for >>>> > kicks, but for work. for benchmarking, my [3 year old] Mac Pro has >>>> > 2.8GHz Xeons, 16GB of RAM, and R 2.13.1. >>>> > >>>> > right now, it seems that 'split()' is why I am losing my bet. (split >>>> > is an integral component of *apply() and by(), so I need split() to be >>>> > fast. its resulting list can then be fed, e.g., to mclapply().) I >>>> > made up an example to illustrate my ills: >>>> > >>>> > library(data.table) >>>> > N <- 1000 >>>> > T <- N*10 >>>> > d <- data.table(data.frame( key= rep(1:T, rep(N,T)), val=rnorm(N*T) )) >>>> > setkey(d, "key"); gc() ## force a garbage collection >>>> > cat("N=", N, ". Size of d=", object.size(d)/1024/1024, "MB\n") >>>> > print(system.time( s<-split(d, d$key) )) >>>> > >>>> > My ordered input data table (or data frame; doesn't make a difference) >>>> > is 114MB in size. it takes about a second to create. split() only >>>> > needs to reshape it. this simple operation takes almost 5 minutes on >>>> > my computer. >>>> > >>>> > with a data set that is larger, this explodes further. >>>> > >>>> > am I doing something wrong? is there an alternative to split()? >>>> > >>>> > sincerely, >>>> > >>>> > /iaw >>>> > >>>> > ---- >>>> > Ivo Welch (ivo.we...@gmail.com) >>>> > >>>> > ______________________________________________ >>>> > R-help@r-project.org mailing list >>>> > https://stat.ethz.ch/mailman/listinfo/r-help >>>> > PLEASE do read the posting guide >>>> > http://www.R-project.org/posting-guide.html >>>> > and provide commented, minimal, self-contained, reproducible code. >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Joshua Wiley > Ph.D. Student, Health Psychology > Programmer Analyst II, ATS Statistical Consulting Group > University of California, Los Angeles > https://joshuawiley.com/ > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.