I have noticed a significant performance degradation using merge in 2.9.1 relative to 2.8.1. Here is what I observed:

  N <- 100000
  X <- data.frame(group=rep(12:1, each=N), mon=rep(rev(month.abb), each=N))
  X$mon <- as.character(X$mon)
  Y <- data.frame(mon=month.abb, letter=letters[1:12])
  Y$mon <- as.character(Y$mon)

  Z <- cbind(Y, group=1:12)

  system.time(Out <- merge(X, Y, by="mon", all=TRUE))
  # R 2.8.1 is 17% faster than R 2.9.1 for N=100000

  system.time(Out <- merge(X, Z, by=c("mon", "group"), all=TRUE))
  # R 2.8.1 is 16% faster than R 2.9.1 for N=100000

Here is the head of summaryRprof() for 2.8.1
$by.self
                   self.time self.pct total.time total.pct
sort.list               4.60     56.5       4.60      56.5
make.unique             1.68     20.6       2.18      26.8
as.character            0.50      6.1       0.50       6.1
duplicated.default      0.50      6.1       0.50       6.1
merge.data.frame        0.20      2.5       8.02      98.5
[.data.frame            0.16      2.0       7.10      87.2

and for 2.9.1
$by.self
                   self.time self.pct total.time total.pct
sort.list               4.66     39.2       4.66      39.2
nchar                   3.28     27.6       3.28      27.6
make.unique             1.42     12.0       1.92      16.2
as.character            0.50      4.2       0.50       4.2
data.frame              0.46      3.9       4.12      34.7
[.data.frame            0.44      3.7       7.28      61.3

As you notice the 2.9.1 has an nchar entry that is quite time consuming.

Is there a way to avoid the degradation in performance in 2.9.1?

Thank you,
Adrian

As an aside, I got interested in testing merge in 2.9.1 by reading the r-devel message from 30-May-2009 "Degraded performance with rank()" by Tim Bergsma, as he mentions doing merges, but only today decided to test.

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to