Hello,

Time down by a factor of 4. It still takes some minutes, 2 mins for a file of 380Mb/3.6M lines. So maybe system commands (maybe awk?) can do the job better.

fun <- function(infile, outfile, lines = 10000L){
    remove <- function(x){
        i1 <- grep("TABLE", x)
        i2 <- grep("COL", x)
        x[-c(i1, i2)]
    }
    fin <- file(infile, open = "rt")
    on.exit(close(fin))
    while(TRUE){
        x <- try(readLines(fin, n = lines))
        if(class(x) == "try-error") return(NULL)
        y <- remove(x[ x != "" ])
        if(length(y) == 0) return(NULL)
        lst <- lapply(strsplit(y, " "), function(.y)
            as.numeric(.y[ .y != "" ]))
        mat <- do.call(rbind, lst)
write.table(mat, outfile, append = TRUE, row.names = FALSE, col.names = FALSE)
    }
}

fun("test", "clean")

Hope this helps,

Rui Barradas
Em 18-10-2012 18:14, Rui Barradas escreveu:
Hello,

The problem doesn't seem to be memory swaps. I've tried with a 380Mb file (3.6M lines) and it took aroun 8.5 minutes. I'll think of something else and write back.

Rui Barradas
Em 18-10-2012 16:42, Fisher Dennis escreveu:
Rui

I tried something similar to this. To my surprise, it was quite slow (it is still running after many minutes). I suspect that that textConnection is a slow process compared to actually reading from the drive. It is possible that the problem is that the object is so large that it is being swapped in and out of virtual memory -- however, this machine has 12 GB RAM so this seems unlikely.

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

On Oct 18, 2012, at 8:35 AM, Rui Barradas wrote:

Hello,

Try the following, readaing your file into 'x', using readLines.



tc <- textConnection("
TABLE NO.  1
COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00

TABLE NO.  1
COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00
")

x <- readLines(tc)
close(tc)

#------------------------ starts here
x <- x[ x != "" ]

i1 <- grep("TABLE", x)
i2 <- grep("COL", x)
y <- x[-c(i1, i2)]

tc <- textConnection(y)
dat <- read.table(tc)
close(tc)

cnames <- unlist(strsplit(x[2], " "))
names(dat) <- cnames[cnames != ""]


Hope this helps,

Rui Barradas
Em 18-10-2012 14:57, Fisher Dennis escreveu:
R 2.15.1
OS X

Colleagues,

I am reading a 1 GB file into R using read.table. The file consists of 100 tables, each of which is headed by two lines of characters.
The first of these lines is:
    TABLE NO.  1
The second is a list of column headers.

For example:
TABLE NO.  1
COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00

Later something similar appears:
TABLE NO.  1
COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00

I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines

read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option)

My word-around has been to:
    1.  read the table with readLines
    2.  remove the problematic lines
    3.  write the file to disk
    4.  read the file with read.table.
However, this process is slow.

I though about using "comment.char" as a means of avoiding reading the problematic lines. However, comment.char does not accept ="[A-Z]"

Are there any clever workarounds for this?

Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to