Re: [R] read.csv and write.csv filtering for very big data ?

ivo welch Tue, 04 Jun 2013 21:11:26 -0700

thx, greg.

chunk boundaries have meanings.  the reader needs to stop, and buffer one
line when it has crossed to the first line beyond the boundary.  it is also
problem that read.csv no longer works with files---readLines then has to do
the processing.  (starting read.csv over and over again with different
skip.lines is probably not a good idea for big files.)  it needs a lot of
smarts to intelligently append to a data frame.  (if the input is a data
matrix, this is much simpler, of course.)


exporting large input files to sqlite data bases makes sense when the same
file is used again and again, but probably not when it is a staged one-time
processor.  the disk consumption is too big.

the writer could become quasi-threaded by writing to multiple temp files
and then concatenating at the end, but this would be a nasty
solution...nothing like the parsimonious elegance and generality that a
built-in R filter function could provide.

----
Ivo Welch (ivo.we...@gmail.com)



On Tue, Jun 4, 2013 at 2:56 PM, Greg Snow <538...@gmail.com> wrote:

> Some possibilities using existing tools.
>
> If you create a file connection and open it before reading from it (or
> writing to it), then functions like read.table and read.csv ( and
> write.table for a writable connection) will read from the connection, but
> not close and reset it.  This means that you could open 2 files, one for
> reading and one for writing, then read in a chunk, process it, write it
> out, then read in the next chunk, etc.
>
> Another option would be to read the data into an ff object (ff package) or
> into a database (SQLite for one) which could have the data accessed in
> chunks, possibly even in parallel.
>
>
> On Mon, Jun 3, 2013 at 4:59 PM, ivo welch <ivo.we...@anderson.ucla.edu>wrote:
>
>> dear R wizards---
>>
>> I presume this is a common problem, so I thought I would ask whether
>> this solution already exists and if not, suggest it.  say, a user has
>> a data set of x GB, where x is very big---say, greater than RAM.
>> fortunately, data often come sequentially in groups, and there is a
>> need to process contiguous subsets of them and write the results to a
>> new file.  read.csv and write.csv only work on FULL data sets.
>> read.csv has the ability to skip n lines and read only m lines, but
>> this can cross the subsets.  the useful solution here would be a
>> "filter" function that understands about chunks:
>>
>>    filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
>>
>> a chunk would not exactly be a factor, because normal R factors can be
>> non-sequential in the data frame.  the filter.csv makes it very simple
>> to work on large data sets...almost SAS simple:
>>
>>    filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv", "date",
>> function(d) colMeans(d))
>> or
>>    filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c >
>> results.csv.bz2"), "date", function(d) d[ unique(d$date), ] )  ##
>> filter out obserations that have the same date again later
>>
>> or some reasonable variant of this.
>>
>> now that I can have many small chunks, it would be nice if this were
>> threadsafe, so
>>
>>    mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
>>
>> with 'library(parallel)' could feed multiple cores the FUNprocess, and
>> make sure that the processes don't step on one another.  (why did R
>> not use a dot after "mc" for parallel lapply?)  presumably, to keep it
>> simple, mcfilter.csv would keep a counter of read chunks and block
>> write chinks until the next sequential chunk in order arrives.
>>
>> just a suggestion...
>>
>> /iaw
>>
>> ----
>> Ivo Welch (ivo.we...@gmail.com)
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Gregory (Greg) L. Snow Ph.D.
> 538...@gmail.com
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] read.csv and write.csv filtering for very big data ?

Reply via email to