And the plyr version of this would be (using DF as the data frame name) ## transform method, mapping length(runtime) to all observations ## similar to David's results: library('plyr') ddply(DF, .(time, partitioning_mode), transform, n = length(runtime)) # or equivalently, the newer and somewhat faster ddply(DF, .(time, partitioning_mode), mutate, n = length(runtime))
# If you just want the counts, then use ddply(DF, .(time, partitioning_mode), summarise, n = length(runtime)) ##--------- # Just for fun, here's the equivalent SQL call using sqldf(): library('sqldf') sqldf('select time partitioning_mode count(*) from DF group by time partitioning_mode') # which you can distribute over multiple lines for readability, e.g. sqldf('select time, partitioning_mode, count(*) as n from DF group by time, partitioning_mode') # Result: time partitioning_mode n 1 1 replication 4 2 1 sharding 11 ##--------- # To do the same type of summary in data.table (to follow up on Jim Holtman's post), here's one way: library(data.table) dt <- data.table(DF, key = 'time, partitioning_mode') dt[, list(n = length(runtime)), by = key(dt)] time partitioning_mode n [1,] 1 replication 4 [2,] 1 sharding 11 ###------ HTH, Dennis On Sun, Oct 23, 2011 at 10:29 AM, Giovanni Azua <brave...@gmail.com> wrote: > Hello, > > This is one problem at the time :) > > I have a data frame df that looks like this: > > time partitioning_mode workload runtime > 1 1 sharding query 607 > 2 1 sharding query 85 > 3 1 sharding query 52 > 4 1 sharding query 79 > 5 1 sharding query 77 > 6 1 sharding query 67 > 7 1 sharding query 98 > 8 1 sharding refresh 2932 > 9 1 sharding refresh 2870 > 10 1 sharding refresh 2877 > 11 1 sharding refresh 2868 > 12 1 replication query 2891 > 13 1 replication query 2907 > 14 1 replication query 2922 > 15 1 replication query 2937 > > and if I could use SQL ... omg! I really wish I could! I would do exactly > this: > > insert into throughput > select time, partitioning_mode, count(*) > from data.frame > group by time, partitioning_mode > > My attempted R versions are wrong and produce very cryptic error message: > >> throughput <- aggregate(x=df[,c("time", "partitioning_mode")], >> by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > incorrect number of dimensions > >> throughput <- aggregate(x=df, by=list(df$time,df$partitioning_mode), count) > Error in `[.default`(df2, u_id, , drop = FALSE) : > incorrect number of dimensions > >>throughput <- tapply(X=df$time, INDEX=list(df$time,df$partitioning), >>FUN=count) > I cant comprehend what comes out from this one ... :( > > and I thought C++ template errors were the most cryptic ;P > > Many many thanks in advance, > Best regards, > Giovanni > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.