I have some processes where I do the same thing, iterate over subsets of a data frame. My data frame has ~250,000 rows, 30 variables, and the subsets are such that there are about 6000 of them.
Performing a which() statement like yours seems quite fast. For example, wrapping unix.time() around the which() expression, I get user system elapsed 0.008 0.000 0.008 It's hard for me to imagine the single task of getting the indexes is slow enough to be a bottleneck. On the other hand, if the variable being used to identify subsets is a factor with many levels (~6000 in my case), it is noticeably slower. user system elapsed 0.024 0.002 0.026 I haven't tested it, and have no real expectation that it will make a difference, but perhaps sorting by the index variable before iterating will help (if you haven't already). Since these are not true indexes in the sense used by relational database systems, maybe it will make a difference. -- Don MacQueen Lawrence Livermore National Laboratory 7000 East Ave., L-627 Livermore, CA 94550 925-423-1062 On 11/20/13 12:16 PM, "Noah Silverman" <noahsilver...@g.ucla.edu> wrote: >Hello, > >I have a fairly large data.frame. (About 150,000 rows of 100 >variables.) There are case IDs, and multiple entries for each ID, with a >date stamp. (i.e. records of peoples activity.) > > >I need to iterate over each person (record ID) in the data set, and then >process their data for each date. The processing part is fast, the date >part is fast. Locating the records is slow. I've even tried using >data.table, with ID set as the index, and it is still slow. > >The line with the slow process (According to Rprof) is: > > >j <- which( d$id == person ) > >(I then process all the records indexed by j, which seems fast enough.) > >where d is my data.frame or data.table > >I thought that using the data.table indexing would speed things up, but >not in this case. > >Any ideas on how to speed this up? > > >Thanks! > >-- >Noah Silverman, M.S., C.Phil >UCLA Department of Statistics >8117 Math Sciences Building >Los Angeles, CA 90095 > >______________________________________________ >R-help@r-project.org mailing list >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide >http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.