2011/9/14 Stefan McKinnon Høj-Edwards <stefan.hoj-edwa...@agrsci.dk>: > Dear R-help, > > I have a very large ascii data file, of which I only want to read in selected > lines (e.g. on fourth of the lines); determining which lines depends on the > lines content. So far, I have found two approaches for doing this in R; 1) > Read the file line by line using a repeat-loop and save the result in a > temporary file or a variable, and 2) Read the entire file and filter/reshape > it using *apply methods. > To my understanding, the use of repeat{}-loops are quite slow in R, and > reading an entire file to discard 3 quarters of the data is a bit of an > overkill. Not to mention loading an 650MB text file into memory. > > What I am looking for is a function, that works like the first approach, but > avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level > language, to be more efficient. Naturally, when calling the function, one > would provide a function that determines if/how the line should be appended > to a variable. > Alternatively, an object working as an generator (in Python terms), could be > used with the normal *apply functions. I imagine this working differently > from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" > would be executed first, loading the entire file into memory and supplying it > to sapply, whereas the generator-object only reads a line when sapply > requests the next element. > > Are there options for this kind of operation? >
read.csv.sql in the sqldf package can read a file and deliver just a subset to R. The portion desired is specified using sql and the entire operation can be done in a single line of code. It can handle files too large to read into R since only the portion desired is ever read into R itself. See Example 13 on the sqldf home page: http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql and also read ?read.csv.sql . -- Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.