Hello,
Time down by a factor of 4. It still takes some minutes, 2 mins for a
file of 380Mb/3.6M lines. So maybe system commands (maybe awk?) can do
the job better.
fun <- function(infile, outfile, lines = 10000L){
remove <- function(x){
i1 <- grep("TABLE", x)
i2 <- grep("COL", x)
x[-c(i1, i2)]
}
fin <- file(infile, open = "rt")
on.exit(close(fin))
while(TRUE){
x <- try(readLines(fin, n = lines))
if(class(x) == "try-error") return(NULL)
y <- remove(x[ x != "" ])
if(length(y) == 0) return(NULL)
lst <- lapply(strsplit(y, " "), function(.y)
as.numeric(.y[ .y != "" ]))
mat <- do.call(rbind, lst)
write.table(mat, outfile, append = TRUE, row.names = FALSE,
col.names = FALSE)
}
}
fun("test", "clean")
Hope this helps,
Rui Barradas
Em 18-10-2012 18:14, Rui Barradas escreveu:
Hello,
The problem doesn't seem to be memory swaps. I've tried with a 380Mb
file (3.6M lines) and it took aroun 8.5 minutes. I'll think of
something else and write back.
Rui Barradas
Em 18-10-2012 16:42, Fisher Dennis escreveu:
Rui
I tried something similar to this. To my surprise, it was quite slow
(it is still running after many minutes). I suspect that that
textConnection is a slow process compared to actually reading from
the drive. It is possible that the problem is that the object is so
large that it is being swapped in and out of virtual memory --
however, this machine has 12 GB RAM so this seems unlikely.
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
On Oct 18, 2012, at 8:35 AM, Rui Barradas wrote:
Hello,
Try the following, readaing your file into 'x', using readLines.
tc <- textConnection("
TABLE NO. 1
COL1 COL2 COL3 COL4 COL5 COL6
COL7 COL8 COL9 COL10 COL11 COL12
1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00
1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08
0.0000E+00 0.0000E+00
1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13
0.0000E+00 0.0000E+00
TABLE NO. 1
COL1 COL2 COL3 COL4 COL5 COL6
COL7 COL8 COL9 COL10 COL11 COL12
1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00
1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08
0.0000E+00 0.0000E+00
1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13
0.0000E+00 0.0000E+00
")
x <- readLines(tc)
close(tc)
#------------------------ starts here
x <- x[ x != "" ]
i1 <- grep("TABLE", x)
i2 <- grep("COL", x)
y <- x[-c(i1, i2)]
tc <- textConnection(y)
dat <- read.table(tc)
close(tc)
cnames <- unlist(strsplit(x[2], " "))
names(dat) <- cnames[cnames != ""]
Hope this helps,
Rui Barradas
Em 18-10-2012 14:57, Fisher Dennis escreveu:
R 2.15.1
OS X
Colleagues,
I am reading a 1 GB file into R using read.table. The file
consists of 100 tables, each of which is headed by two lines of
characters.
The first of these lines is:
TABLE NO. 1
The second is a list of column headers.
For example:
TABLE NO. 1
COL1 COL2 COL3 COL4 COL5 COL6
COL7 COL8 COL9 COL10 COL11 COL12
1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00
1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08
0.0000E+00 0.0000E+00
1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13
0.0000E+00 0.0000E+00
Later something similar appears:
TABLE NO. 1
COL1 COL2 COL3 COL4 COL5 COL6
COL7 COL8 COL9 COL10 COL11 COL12
1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00
1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08
0.0000E+00 0.0000E+00
1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13
0.0000E+00 0.0000E+00
I will use the term "problematic lines" to refer to the repeated
occurrences of the two non-data lines
read.table is not successful in reading the table because of these
problematic lines (I get around the first "TABLE NO." line using
the skip option)
My word-around has been to:
1. read the table with readLines
2. remove the problematic lines
3. write the file to disk
4. read the file with read.table.
However, this process is slow.
I though about using "comment.char" as a means of avoiding reading
the problematic lines. However, comment.char does not accept ="[A-Z]"
Are there any clever workarounds for this?
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.