Hi, On Tue, Nov 26, 2013 at 11:41 AM, Noah Silverman <noahsilver...@ucla.edu> wrote: > All interesting suggestions. > > I guess a better example of the code would have been a good idea. So, > I'll put a relevant snippet here. > > Rows are cases. There are multiple cases for each ID, marked with a > date. I'm trying to calculate a time recency weighted score for a > covariate, added as a new column in the data.frame. > > So, for each row, I need to see which ID it belongs to, then get all the > scores prior to this row's date, then compute the recency weighted summary. > > Right now, I do this in an obvious, but very very slow way. > > Here is my slow code: > ====================== > for(i in 1:nrow(d)){ > for(j in which( d$id == d$id[i] & d$date[j] < d$date[i]) ){ > days_since = as.numeric( d$date[i] - d$date[j] ) > w <- exp( -days_since/decay ) > temp <- temp + w * as.numeric(d[j,'score']) > wTemp <- wTemp + w > } > > temp <- temp / wTemp > d$newScore[i,] <- temp > } > ====================== > > One immediate thought was to turn the "date" into an integer. That > should save a few cycles of date math. > > I need to do this process for a bunch of scores. A grid search over > different time decay levels might be nice. So any speedup to this > routine will save me a ton of time. > > Ideas?
A few quick ones. You had said you tried data.table and found it to be slow still -- my guess is that you might not have used it correctly, so here is a rough sketch of what to do. Let's assume that your date is converted to some integer -- I will leave that excercise to you :-) -- but it seems like you just want to calculate number of (whole) days since an event that you have a record for, so this should be (in principle) easy to do (if you really need full power of "date math", data.table supports that as well). Also you never "reset" your `temp` variable, so it looks like you are carrying over `temp` from one `id` group to the next (and, while I have no knowledge of your problem, I would imagine this is not what you want to do) Anyway some rough ideas to get you started: R> d <- as.data.table(d) R> setkeyv(d, c('id', 'date')) Now records within each date are ordered from first to last. The specifics of your decay score escape me a bit, eg. what is the value of "days_since" for the first record of each id? I'll let you figure that out, but in the non-edge cases, it looks like you can just calculate "days since" by subtracting the current date from the date recorded in the record before it. (Note that `.I` is special data.table variable for the row number of a given record in the original data.table): d[, newScore := { ## handle edge case for first record w/in each `id` group days_since <- date - d$date[.I -1] w <- exp(-days_since / decay) ## ... ## Some other stuff you are doing here which I can't ## understand with temp ... then multiple the 'score' column ## for the given row by the your correctly calculated weight `w` ## for that row (whatever it might be). w * score }, by='id'] HTH, -steve -- Steve Lianoglou Computational Biologist Genentech ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.