Dear R-help,

I have a very large ascii data file, of which I only want to read in selected 
lines (e.g. on fourth of the lines); determining which lines depends on the 
lines content. So far, I have found two approaches for doing this in R; 1) Read 
the file line by line using a repeat-loop and save the result in a temporary 
file or a variable, and 2) Read the entire file and filter/reshape it using 
*apply methods.
To my understanding, the use of repeat{}-loops are quite slow in R, and reading 
an entire file to discard 3 quarters of the data is a bit of an overkill. Not 
to mention loading an 650MB text file into memory.

What I am looking for is a function, that works like the first approach, but 
avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level 
language, to be more efficient. Naturally, when calling the function, one would 
provide a function that determines if/how the line should be appended to a 
variable.
Alternatively, an object working as an generator (in Python terms), could be 
used with the normal *apply functions. I imagine this working differently from 
e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would 
be executed first, loading the entire file into memory and supplying it to 
sapply, whereas the generator-object only reads a line when sapply requests the 
next element.

Are there options for this kind of operation?

Kind regards,

Stefan McKinnon Høj-Edwards     Dept. of Genetics and Biotechnology
PhD student                     Faculty of Agricultural Sciences
stefan.hoj-edwa...@agrsci.dk    Aarhus University
Tel.: +45 8999 1291             Blichers Allé 20, Postboks 50
Web: www.iysik.com              DK-8830 Tjele
                                Tel.: +45 8999 1900
                                Web: www.agrsci.au.dk

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to