Thanks; I did not notice an appreciable difference between scan() and scan(what=double()) in this example. Adding to my confusion, I noted a strange and apparently systematic discrepency between the timing results when the code is run within R.app, within emacs, or from a terminal. Any idea what might be causing this?
Thanks, baptiste On 2 April 2012 11:04, Duncan Murdoch <murdoch.dun...@gmail.com> wrote: > On 12-04-01 2:58 AM, baptiste auguie wrote: >> >> Dear list, >> >> I am trying to find a fast solution to read moderately large (1 -- 10 >> million entries) text files containing only tab-delimited numeric >> values. My test file is the following, >> >> nr<- 1000 >> nc<- 5000 >> >> m<- matrix(round(rnorm(nr*nc),3),nr=nr) >> write.table(m, file = "a.txt", append=FALSE, >> row.names = FALSE, col.names = FALSE) >> >> >> scan() is faster than read.table(), as expected, but still quite slow >> compared to Matlab for example. Based on archived discussions on this >> list and Stack Overflow, I tried readChar(); it's really fast. >> However, it returns a long character string, where I really want >> numeric values. I can use as.numeric(strsplit()), but to my complete >> surprise it is faster to run scan() on this text string. Consider the >> following comparison (I use the command line wc to optimize the memory >> allocation), > > > Tell it the types of the columns, and it will go a bit faster. > > Duncan Murdoch > >> >> load_file1<- function(f){ >> ## ask wc the number of words >> n<- scan(textConnection(system(paste("wc -w ", f), intern=TRUE)), >> what=list(integer(), character()), quiet=TRUE)[[1]] >> all<- scan(f, nmax=n, quiet=TRUE) >> invisible(all) >> } >> >> load_file2<- function(f){ >> ## ask wc the number of characters >> n<- scan(textConnection(system(paste("wc -m ", f), intern=TRUE)), >> what=list(integer(), character()), quiet=TRUE)[[1]] >> tc<- textConnection(readChar(f, n)) >> all<- scan(tc, quiet=TRUE, multi.line = FALSE) >> close(tc) >> invisible(all) >> } >> >> >> system.time(a<- load_file1("a.txt")) >> ## user system elapsed >> ## 7.805 0.138 8.026 >> system.time(b<- load_file2("a.txt")) >> ## user system elapsed >> ## 2.182 0.301 2.538 >> all.equal(a, b) >> ##> [1] TRUE >> >> >> Could someone explain to me why it is faster to scan a textConnection >> than the original file? Have I missed a better solution? >> >> Thanks, >> >> baptiste >> >> sessionInfo() >> R version 2.15.0 RC (2012-03-29 r58868) >> Platform: i386-apple-darwin9.8.0/i386 (32-bit) >> >> locale: >> [1] C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.