Hi all, Thanks a lot for your responses. I forgot to mention in my posting that I also want to try to be as cross-platform as possible (and thus to avoid relying on external calls to UNIX programs such as grep). I like the idea from Thomas:
You certainly don't want to use repeated reads from the start of the file > with skip=, but if you set up a file connection > fileconnection <- file("my.tsv", open="r") > you can read from it incrementally with readLines() or read.delim() without > going back to the start each time. But doing this (with the required calls to grep()) is *MUCH* slower than piping the file through UNIX "grep" -- so I think I will end up just sticking with that option and just ask Windows users install "grep" in order to use my software. --David *David J Reiss, PhD <http://dreiss.isb.googlepages.com>* Senior Research Scientist, Computational Biology, Baliga Lab<http://baliga.systemsbiology.net/> Institute for Systems Biology <http://www.systemsbiology.org> 1441 N 34th St., Seattle, WA 98103 (USA) On Tue, Mar 24, 2009 at 5:12 AM, Dirk Eddelbuettel <e...@debian.org> wrote: > > Hi David, > > On 23 March 2009 at 15:09, Dylan Beaudette wrote: > | On Monday 23 March 2009, David Reiss wrote: > | > I have a very large tab-delimited file, too big to store in memory via > | > readLines() or read.delim(). Turns out I only need a few hundred of > those > | > lines to be read in. If it were not so large, I could read the entire > file > | > in and "grep" the lines I need. For such a large file; many calls to > | > read.delim() with incrementing "skip" and "nrows" parameters, followed > by > | > grep() calls is very slow. I am aware of possibilities via SQLite; I > would > | > prefer to not use that in this case. > | > > | > My question is...Is there a function for efficiently reading in a file > | > along the lines of read.delim(), which allows me to specify a filter > (via > | > grep or something else) that tells the function to only read in certain > | > lines that match? > | > > | > If not, I would *love* to see a "filter" parameter added as an option > to > | > read.delim() and/or readLines(). > | > | How about pre-filtering before loading the data into R: > | > | grep -E 'your pattern here' your_file_here > your_filtered_file > | > | alternatively if you need to search in fields, see 'awk', and 'cut', or > if you > | need to delete things see 'tr'. > | > | These tools come with any unix-like OS, and you can probably get them on > | windows without much effort. > > Also note that read.delim() and friends all read from connections, and > 'piped > expressions' (in the Unix shell command sense) can provide a source. > > That way you can build an ad-hoc filter extension by running readLines() > over > a pipe() connection. Consider this trivial example of grepping out Section > headers from the R FAQ. We get everything double because of the Table of > Contents and the actual section headers: > > R> readLines( pipe("awk '/^[0-9+] / {print $1, $2, $3}' > src/debian/R/R-alpha.20090320/doc/FAQ") ) > [1] "1 Introduction " "2 R Basics" "3 R and" "4 R Web" > [5] "5 R Add-On" "6 R and" "7 R Miscellanea" "8 R > Programming" > [9] "9 R Bugs" "1 Introduction " "2 R Basics" "3 R and" > [13] "4 R Web" "5 R Add-On" "6 R and" "7 R > Miscellanea" > [17] "8 R Programming" "9 R Bugs" > R> > > The regexp is simply 'digits at start of line followed by space' which > skips > subsections like 1.1, 1.2, ... > > Hth, Dirk > > -- > Three out of two people have difficulties with fractions. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.