[R] Speeding reading of a large file

Fisher Dennis Mon, 03 Dec 2012 14:51:10 -0800

Colleagues,  

This past week, I asked the following question:


        I have a file that looks that this:

        TABLE NO.  1
         PTID        TIME        AMT         FORM        PERIOD      IPRED      
 CWRES       EVID        CP          PRED        RES         WRES
          2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  
0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00  
0.0000E+00
          2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  
3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00 0.0000E+00  
0.0000E+00
          2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  
5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00 0.0000E+00  
0.0000E+00
          2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  
8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00 0.0000E+00  
0.0000E+00
          2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  
1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01 0.0000E+00  
0.0000E+00
          2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  
1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01 0.0000E+00  
0.0000E+00
          2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  
1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01 0.0000E+00  
0.0000E+00
          2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  
1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01 0.0000E+00  
0.0000E+00

        The file is reasonably large (> 10^6 lines) and the two line header is 
repeated periodically in the file.
        I need to read this file in as a data frame.  Note that the number of 
columns, the column headers, and the number of replicates of the headers are 
not known in advance.

I received a number of replies, many of them quite useful.  Of these, one beat 
out all the others in my benchmarking using files ranging from 10^5 to 10^6 
lines.
That version, provided by Jim Holtman, was:
        x               <- read.table(FILE, as.is = TRUE, skip=1, fill=TRUE, 
header = TRUE)
        x[]             <- lapply(x, as.numeric)
        x               <- x[!is.na(x[,1]), ]

Other versions involved readLines, following by edits, following by cat (or 
write) to a temp file, then read.table again.  
The overhead with invoking readLines, write/cat, and read.table was 
substantially larger than the strategy of read.table / as.numeric / indexing

Thanks for the input from many folks.

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Speeding reading of a large file

Reply via email to