To quickly see if any duplicates exist you could use table() and checking for entries that are more than 1. Use na.omit() on the entire data.frame before passing it to table. E.g., tmp <- with(na.omit(df1), table(parameter, station, site)) sum(tmp>1) # number of parameter/station/site keys with >1 entry That took 0.13 seconds on my machine, where your aggregate call took 18.42 seconds.
To keep only the first entry for a given key try something like df1.nodups <- df1[with(df1, !duplicated(paste(sep="\1",parameter,station,site))),] That is also very quick (0.06 seconds here). Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com > Folks, > > I'm checking the structure of a dataframe for duplicate parameters at a > site station (i.e depth should be measured once, not twice), using > aggregate to count each parameter within a site station. The fake data > below has only 26000 rows, and takes roughly 14 seconds. My real data has > 750000 rows and I had to stop execution after about an hour. The by() > function is faster, but I do not understand how to accurately associate > those results with my test data. > > How can I get this to work faster? I can't shake the feeling that it's > something simple. Thanks for any pointers. > > df <- data.frame(expand.grid('parameter'=LETTERS > ,'station'=letters[1:10] > ,'site'=1:100 > ) > ) > df$parameter = as.character(df$parameter) > df$station = as.character(df$station) > > df1 <- rbind(df, df[runif(nrow(df))>0.99,]) # add some duplicate > parameters > > tt <- df1 > system.time(tt <- aggregate(I(df1$parameter) > ,list('site'=df1$site > ,'station'=df1$station > ,'parameter'=df1$parameter > ) > ,function(x) { length(na.omit(x)) } > ) > ) > system.time(tt2 <- by(I(df1$parameter) > ,list('site'=df1$site > ,'station'=df1$station > ,'parameter'=df1$parameter > ) > ,function(x) { length(na.omit(x)) } > ,simplify=TRUE > ) > ) > > cur > > -- > Curt Seeliger, Data Ranger > Raytheon Information Services - Contractor to ORD > seeliger.c...@epa.gov > 541/754-4638 > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.