I noticed that joining two data.frames  in R using the "merge"
function that using by='row.names'  slows things down substantially
when compared to just joining on a common index column.

Using a dataframe size of ~10,000 rows: it's as slow as 10 minutes in
the by='row.names' case versus merely 1 second using an index column.
Beyond the 10^6 range, it's unusably slow.


n <- 5
a <- data.frame(id=as.character(1:10^n), x=rnorm(10^n)); rownames(a)
<- a$id
b <- data.frame(id=as.character(1:10^n + 10^(n-1)), y=rnorm(10^n));
rownames(b) <- b$id

date()
fast <- merge(a, b,  all=T)
date()
slow <- merge(a, b, all=T, by='row.names')
date()


Has anybody else noticed this?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to