Hi All, I'm working on analyzing a large data set, lets asume that dim(Data)=c(1000,8700). I want to calculate the canberra distance between the columns of this matrix, and using a toy example ('test' is a matrix filled with random numbers 0-1):
> system.time(d<-as.matrix(dist(t(test), method = "canberra", diag = FALSE, > upper = FALSE, p = 2))) user system elapsed 1417.713 3.219 1421.144 Is there any way to calculate the distance which would take less time? I am already parallelizing this to a great deal (the real data has many more rows), but I cant go below 1000 rows in order to get reliable results. And I will calculate the distances repeatedly (about 100 times if 1000 rows) while removing small parts of the matrix. The system.time results also confuse me a bit, since 99% of the time is not system time but user time. What does that mean? I'm on a Linux server and should have about 48GB RAM here. Any suggestions appreciated, Bo > sessionInfo() R version 2.12.1 (2010-12-16) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.iso885915 LC_NUMERIC=C [3] LC_TIME=en_US.iso885915 LC_COLLATE=en_US.iso885915 [5] LC_MONETARY=C LC_MESSAGES=en_US.iso885915 [7] LC_PAPER=en_US.iso885915 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.iso885915 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hash_2.1.0 mmap_0.6-9 loaded via a namespace (and not attached): [1] tools_2.12.1 $ uname -a Linux compute-13-2.local 2.6.18-164.6.1.el5 #1 SMP Tue Nov 3 16:12:36 EST 2009 x86_64 x86_64 x86_64 GNU/Linux ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.