Re: [R] read in large data file (tsv) with inline filter?

David Reiss Tue, 24 Mar 2009 12:13:15 -0700

Hi all,
Thanks a lot for your responses. I forgot to mention in my posting that I
also want to try to be as cross-platform as possible (and thus to avoid
relying on external calls to UNIX programs such as grep). I like the idea
from Thomas:


You certainly don't want to use repeated reads from the start of the file
> with skip=,  but if you set up a file connection
>   fileconnection <- file("my.tsv", open="r")
> you can read from it incrementally with readLines() or read.delim() without
> going back to the start each time.


But doing this (with the required calls to grep()) is *MUCH* slower than
piping the file through UNIX "grep" -- so I think I will end up just
sticking with that option and just ask Windows users install "grep" in order
to use my software.

--David

 *David J Reiss, PhD <http://dreiss.isb.googlepages.com>*
Senior Research Scientist, Computational Biology, Baliga
Lab<http://baliga.systemsbiology.net/>
Institute for Systems Biology <http://www.systemsbiology.org>
1441 N 34th St., Seattle, WA 98103 (USA)



On Tue, Mar 24, 2009 at 5:12 AM, Dirk Eddelbuettel <e...@debian.org> wrote:

>
> Hi David,
>
> On 23 March 2009 at 15:09, Dylan Beaudette wrote:
> | On Monday 23 March 2009, David Reiss wrote:
> | > I have a very large tab-delimited file, too big to store in memory via
> | > readLines() or read.delim(). Turns out I only need a few hundred of
> those
> | > lines to be read in. If it were not so large, I could read the entire
> file
> | > in and "grep" the lines I need. For such a large file; many calls to
> | > read.delim() with incrementing "skip" and "nrows" parameters, followed
> by
> | > grep() calls is very slow. I am aware of possibilities via SQLite; I
> would
> | > prefer to not use that in this case.
> | >
> | > My question is...Is there a function for efficiently reading in a file
> | > along the lines of read.delim(), which allows me to specify a filter
> (via
> | > grep or something else) that tells the function to only read in certain
> | > lines that match?
> | >
> | > If not, I would *love* to see a "filter" parameter added as an option
> to
> | > read.delim() and/or readLines().
> |
> | How about pre-filtering before loading the data into R:
> |
> | grep -E 'your pattern here' your_file_here > your_filtered_file
> |
> | alternatively if you need to search in fields, see 'awk', and 'cut', or
> if you
> | need to delete things see 'tr'.
> |
> | These tools come with any unix-like OS, and you can probably get them on
> | windows without much effort.
>
> Also note that read.delim() and friends all read from connections, and
> 'piped
> expressions' (in the Unix shell command sense) can provide a source.
>
> That way you can build an ad-hoc filter extension by running readLines()
> over
> a pipe() connection.  Consider this trivial example of grepping out Section
> headers from the R FAQ.  We get everything double because of the Table of
> Contents and the actual section headers:
>
> R> readLines( pipe("awk '/^[0-9+] / {print $1, $2, $3}'
> src/debian/R/R-alpha.20090320/doc/FAQ") )
>  [1] "1 Introduction " "2 R Basics"      "3 R and"         "4 R Web"
>  [5] "5 R Add-On"      "6 R and"         "7 R Miscellanea" "8 R
> Programming"
>  [9] "9 R Bugs"        "1 Introduction " "2 R Basics"      "3 R and"
> [13] "4 R Web"         "5 R Add-On"      "6 R and"         "7 R
> Miscellanea"
> [17] "8 R Programming" "9 R Bugs"
> R>
>
> The regexp is simply 'digits at start of line followed by space' which
> skips
> subsections like 1.1, 1.2, ...
>
> Hth, Dirk
>
> --
> Three out of two people have difficulties with fractions.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read in large data file (tsv) with inline filter?

Reply via email to