What is overkill about reading in a 650MB text file if you have the space? You are going to have to process one way or another. I would use 'readLines' to read it in, and then 'grepl' to determine which lines I want to keep and then delete the rest, and then write the new file out. At this point I can probably use 'read.table' to now process the new file. This works pretty fast if you can apply pattern matching to determine which lines you want to keep.
If you don't have the memory to read in the whole file, then setup a look and read in whatever amount makes sense (e.g., 100MB at a time), and then do the processing above with the output file opened at the beginning so that you continue to add to it. You probably need to state what type of criteria you would be applying to the lines to determine if you want to keep them. You can also use perl, sed awk, .... to do the processing 2011/9/14 Stefan McKinnon Høj-Edwards <stefan.hoj-edwa...@agrsci.dk>: > Dear R-help, > > I have a very large ascii data file, of which I only want to read in selected > lines (e.g. on fourth of the lines); determining which lines depends on the > lines content. So far, I have found two approaches for doing this in R; 1) > Read the file line by line using a repeat-loop and save the result in a > temporary file or a variable, and 2) Read the entire file and filter/reshape > it using *apply methods. > To my understanding, the use of repeat{}-loops are quite slow in R, and > reading an entire file to discard 3 quarters of the data is a bit of an > overkill. Not to mention loading an 650MB text file into memory. > > What I am looking for is a function, that works like the first approach, but > avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level > language, to be more efficient. Naturally, when calling the function, one > would provide a function that determines if/how the line should be appended > to a variable. > Alternatively, an object working as an generator (in Python terms), could be > used with the normal *apply functions. I imagine this working differently > from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" > would be executed first, loading the entire file into memory and supplying it > to sapply, whereas the generator-object only reads a line when sapply > requests the next element. > > Are there options for this kind of operation? > > Kind regards, > > Stefan McKinnon Høj-Edwards Dept. of Genetics and Biotechnology > PhD student Faculty of Agricultural Sciences > stefan.hoj-edwa...@agrsci.dk Aarhus University > Tel.: +45 8999 1291 Blichers Allé 20, Postboks 50 > Web: www.iysik.com DK-8830 Tjele > Tel.: +45 8999 1900 > Web: www.agrsci.au.dk > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.