Re: [R] speeding read.table

Rui Barradas Thu, 18 Oct 2012 11:30:26 -0700

Hello,

Time down by a factor of 4. It still takes some minutes, 2 mins for afile of 380Mb/3.6M lines. So maybe system commands (maybe awk?) can dothe job better.


fun <- function(infile, outfile, lines = 10000L){
    remove <- function(x){
        i1 <- grep("TABLE", x)
        i2 <- grep("COL", x)
        x[-c(i1, i2)]
    }
    fin <- file(infile, open = "rt")
    on.exit(close(fin))
    while(TRUE){
        x <- try(readLines(fin, n = lines))
        if(class(x) == "try-error") return(NULL)
        y <- remove(x[ x != "" ])
        if(length(y) == 0) return(NULL)
        lst <- lapply(strsplit(y, " "), function(.y)
            as.numeric(.y[ .y != "" ]))
        mat <- do.call(rbind, lst)

write.table(mat, outfile, append = TRUE, row.names = FALSE,col.names = FALSE)

    }
}

fun("test", "clean")

Hope this helps,

Rui Barradas
Em 18-10-2012 18:14, Rui Barradas escreveu:

Hello,
The problem doesn't seem to be memory swaps. I've tried with a 380Mbfile (3.6M lines) and it took aroun 8.5 minutes. I'll think ofsomething else and write back.
Rui Barradas
Em 18-10-2012 16:42, Fisher Dennis escreveu:
Rui
I tried something similar to this. To my surprise, it was quite slow(it is still running after many minutes). I suspect that thattextConnection is a slow process compared to actually reading fromthe drive. It is possible that the problem is that the object is solarge that it is being swapped in and out of virtual memory --however, this machine has 12 GB RAM so this seems unlikely.
Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

On Oct 18, 2012, at 8:35 AM, Rui Barradas wrote:
Hello,

Try the following, readaing your file into 'x', using readLines.



tc <- textConnection("
TABLE NO.  1
COL1 COL2 COL3 COL4 COL5 COL6COL7 COL8 COL9 COL10 COL11 COL121.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+001.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+000.0000E+00 0.0000E+001.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+001.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-080.0000E+00 0.0000E+001.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+001.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-130.0000E+00 0.0000E+00
TABLE NO.  1
COL1 COL2 COL3 COL4 COL5 COL6COL7 COL8 COL9 COL10 COL11 COL121.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+001.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+000.0000E+00 0.0000E+001.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+001.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-080.0000E+00 0.0000E+001.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+001.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-130.0000E+00 0.0000E+00
")

x <- readLines(tc)
close(tc)

#------------------------ starts here
x <- x[ x != "" ]

i1 <- grep("TABLE", x)
i2 <- grep("COL", x)
y <- x[-c(i1, i2)]

tc <- textConnection(y)
dat <- read.table(tc)
close(tc)

cnames <- unlist(strsplit(x[2], " "))
names(dat) <- cnames[cnames != ""]


Hope this helps,

Rui Barradas
Em 18-10-2012 14:57, Fisher Dennis escreveu:
R 2.15.1
OS X

Colleagues,
I am reading a 1 GB file into R using read.table. The fileconsists of 100 tables, each of which is headed by two lines ofcharacters.
The first of these lines is:
    TABLE NO.  1
The second is a list of column headers.

For example:
TABLE NO.  1
COL1 COL2 COL3 COL4 COL5 COL6COL7 COL8 COL9 COL10 COL11 COL121.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+001.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+000.0000E+00 0.0000E+001.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+001.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-080.0000E+00 0.0000E+001.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+001.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-130.0000E+00 0.0000E+00
Later something similar appears:
TABLE NO.  1
COL1 COL2 COL3 COL4 COL5 COL6COL7 COL8 COL9 COL10 COL11 COL121.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+001.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+000.0000E+00 0.0000E+001.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+001.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-080.0000E+00 0.0000E+001.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+001.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-130.0000E+00 0.0000E+00
I will use the term "problematic lines" to refer to the repeatedoccurrences of the two non-data lines
read.table is not successful in reading the table because of theseproblematic lines (I get around the first "TABLE NO." line usingthe skip option)
My word-around has been to:
    1.  read the table with readLines
    2.  remove the problematic lines
    3.  write the file to disk
    4.  read the file with read.table.
However, this process is slow.
I though about using "comment.char" as a means of avoiding readingthe problematic lines. However, comment.char does not accept ="[A-Z]"
Are there any clever workarounds for this?

Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] speeding read.table

Reply via email to