Re: [R] Reading large, non-tabular files

jim holtman Wed, 14 Sep 2011 06:41:35 -0700

What is overkill about reading in a 650MB text file if you have the
space?  You are going to have to process one way or another.  I would
use 'readLines' to read it in, and then 'grepl' to determine which
lines I want to keep and then delete the rest, and then write the new
file out.  At this point I can probably use 'read.table' to now
process the new file.  This works pretty fast if you can apply pattern
matching to determine which lines you want to keep.


If you don't have the memory to read in the whole file, then setup a
look and read in whatever amount makes sense (e.g., 100MB at a time),
and then do the processing above with the output file opened at the
beginning so that you continue to add to it.

You probably need to state what type of criteria you would be applying
to the lines to determine if you want to keep them.

You can also use perl, sed awk, .... to do the processing

2011/9/14 Stefan McKinnon Høj-Edwards <stefan.hoj-edwa...@agrsci.dk>:
> Dear R-help,
>
> I have a very large ascii data file, of which I only want to read in selected 
> lines (e.g. on fourth of the lines); determining which lines depends on the 
> lines content. So far, I have found two approaches for doing this in R; 1) 
> Read the file line by line using a repeat-loop and save the result in a 
> temporary file or a variable, and 2) Read the entire file and filter/reshape 
> it using *apply methods.
> To my understanding, the use of repeat{}-loops are quite slow in R, and 
> reading an entire file to discard 3 quarters of the data is a bit of an 
> overkill. Not to mention loading an 650MB text file into memory.
>
> What I am looking for is a function, that works like the first approach, but 
> avoiding do- or repeat-loops, so I imagine it is implemented in a lower-level 
> language, to be more efficient. Naturally, when calling the function, one 
> would provide a function that determines if/how the line should be appended 
> to a variable.
> Alternatively, an object working as an generator (in Python terms), could be 
> used with the normal *apply functions. I imagine this working differently 
> from e.g. sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" 
> would be executed first, loading the entire file into memory and supplying it 
> to sapply, whereas the generator-object only reads a line when sapply 
> requests the next element.
>
> Are there options for this kind of operation?
>
> Kind regards,
>
> Stefan McKinnon Høj-Edwards     Dept. of Genetics and Biotechnology
> PhD student                     Faculty of Agricultural Sciences
> stefan.hoj-edwa...@agrsci.dk    Aarhus University
> Tel.: +45 8999 1291             Blichers Allé 20, Postboks 50
> Web: www.iysik.com              DK-8830 Tjele
>                                Tel.: +45 8999 1900
>                                Web: www.agrsci.au.dk
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Reading large, non-tabular files

Reply via email to