Hi David, On 23 March 2009 at 15:09, Dylan Beaudette wrote: | On Monday 23 March 2009, David Reiss wrote: | > I have a very large tab-delimited file, too big to store in memory via | > readLines() or read.delim(). Turns out I only need a few hundred of those | > lines to be read in. If it were not so large, I could read the entire file | > in and "grep" the lines I need. For such a large file; many calls to | > read.delim() with incrementing "skip" and "nrows" parameters, followed by | > grep() calls is very slow. I am aware of possibilities via SQLite; I would | > prefer to not use that in this case. | > | > My question is...Is there a function for efficiently reading in a file | > along the lines of read.delim(), which allows me to specify a filter (via | > grep or something else) that tells the function to only read in certain | > lines that match? | > | > If not, I would *love* to see a "filter" parameter added as an option to | > read.delim() and/or readLines(). | | How about pre-filtering before loading the data into R: | | grep -E 'your pattern here' your_file_here > your_filtered_file | | alternatively if you need to search in fields, see 'awk', and 'cut', or if you | need to delete things see 'tr'. | | These tools come with any unix-like OS, and you can probably get them on | windows without much effort.
Also note that read.delim() and friends all read from connections, and 'piped expressions' (in the Unix shell command sense) can provide a source. That way you can build an ad-hoc filter extension by running readLines() over a pipe() connection. Consider this trivial example of grepping out Section headers from the R FAQ. We get everything double because of the Table of Contents and the actual section headers: R> readLines( pipe("awk '/^[0-9+] / {print $1, $2, $3}' src/debian/R/R-alpha.20090320/doc/FAQ") ) [1] "1 Introduction " "2 R Basics" "3 R and" "4 R Web" [5] "5 R Add-On" "6 R and" "7 R Miscellanea" "8 R Programming" [9] "9 R Bugs" "1 Introduction " "2 R Basics" "3 R and" [13] "4 R Web" "5 R Add-On" "6 R and" "7 R Miscellanea" [17] "8 R Programming" "9 R Bugs" R> The regexp is simply 'digits at start of line followed by space' which skips subsections like 1.1, 1.2, ... Hth, Dirk -- Three out of two people have difficulties with fractions. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.