Hi Hadley, Does FAQ 1.8 answer that ok ? "Ok, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?" http://datatable.r-forge.r-project.org/datatable-faq.pdf
Matthew "Hadley Wickham" <had...@rice.edu> wrote in message news:AANLkTik180p4YmBtR3QUCW7r=fdefxzbxsy3zwtik...@mail.gmail.com... On Mon, Feb 7, 2011 at 5:54 AM, Matthew Dowle <mdo...@mdowle.plus.com> wrote: > Looking at the timings by each stage may help : > >> system.time(dt <- data.table(dat)) > user system elapsed > 1.20 0.28 1.48 >> system.time(setkey(dt, x1, x2, x3, x4, x5, x6, x7, x8)) # sort by the >> 8 columns (one-off) > user system elapsed > 4.72 0.94 5.67 >> system.time(udt <- dt[, list(y = sum(y, na.rm = TRUE)), by = 'x1, x2, >> x3, x4, x5, x6, x7, x8']) > user system elapsed > 2.00 0.21 2.20 # compared to 11.07s >> > > data.table doesn't have a custom data structure, so it can't be that. > data.table's structure is the same as data.frame i.e. a list of vectors. > data.table inherits from data.frame. It *is* a data.frame, too. > > The reasons it is faster in this example include : > 1. Memory is only allocated for the largest group. > 2. That memory is re-used for each group. > 3. Since the data is ordered contiguously in RAM, the memory is copied > over > in bulk for each group using > memcpy in C, which is faster than a for loop in C. Page fetches are > expensive; they are minimised. But this is exactly what I mean by a custom data structure - you're not using the usual data frame API. Wouldn't it be better to implement these changes to data frame so that everyone can benefit? Or is it just too specialised to this particular case (where I guess you're using that the return data structure of the summary function is consistent)? Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.