Hello all,

I saw the notes come through from today's call:

> * R Arrow Bindings?
>  - Find use cases within the R community, contributors needed
>  - R Feather bindings a useful starting point

This year I've been working on parallel R on datasets in the 100+ GB range,
and have found that loading and saving data from text files is a real
bottleneck. Another consideration is breaking the data up into chunks for
parallel processing while maintaining metadata and overall structure. So
I've been watching Parquet and Arrow.

Specifically here are two use cases in R where Arrow / Parquet could be
helpful:

- Splitting up a large data set into pieces which fit comfortably in memory
then applying normal R functions to each piece. Basically GROUP BY.
- Matloff's Software Alchemy, statistical averaging based on independent
chunks of data. This requires rows to be randomly assigned to chunks.

Another option besides starting from the R Feather bindings is to start
with an automatically generated set of bindings:
https://github.com/duncantl/RCodeGen

Best,
Clark Fitzgerald

Reply via email to