For me with ff - on a 3 GB notebook - 3e6x100 works out of the box even without 
compression: doubles consume 2.2 GB on disk, but the R process remains under 
100MB, rest of RAM used by file-system-cache.
If you are under windows, you can create the ffdf files in a compressed folder. 
For the random doubles this reduces size on disk to 230MB - which should even 
work on a 1GB notebook.
BTW: the most compressed datatype (vmode) that can handle NAs is "logical": 
consumes 2bit per tri-bool. The nextmost compressed is "byte" covering c(NA, 
-127:127) and consuming its name on disk and in fs-cache.

The code below should give an idea of how to do pairwise stats on columns where 
each pair fits easily into RAM. In the real world, you would not create the 
data but import it using read.csv.ffdf (expect that reading your file takes 
longer than reading/writing the ffdf).

Regards


Jens Oehlschlägel



library(ff)
k <- 100
n <- 3e6

# creating a ffdf dataframe of the requires size
l <- vector("list", k)
for (i in 1:k)
  l[[i]] <- ff(vmode="double", length=n, update=FALSE)
names(l) <- paste("c", 1:k, sep="")
d <- do.call("ffdf", l)

# writing 100 columns of 1e6 random data takes 90 sec
system.time(
for (i in 1:k){
  cat(i, " ")
  print(system.time(d[,i] <- rnorm(n))["elapsed"])
  }
)["elapsed"]


m <- matrix(as.double(NA), k, k)

# pairwise correlating one column against all others takes ~ 17.5 sec
# pairwise correlating all combinations takes 15 min
system.time(
for (i in 2:k){
  cat(i, " ")
  print(system.time({
    x <- d[[i]][]
    for (j in 1:(i-1)){
      m[i,j] <- cor(x, d[[j]][])
    }
  })["elapsed"])
}
)["elapsed"]


-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to