Re: Use case for R Arrow Bindings

Bryan Cutler Fri, 21 Jul 2017 11:05:21 -0700

Thanks Clark.  I know that SparkR would benefit a lot from Arrow bindings
and many people would like to see that, but to my knowledge no one has
started working on this yet.  Please keep us updated with what you find!


Bryan

On Fri, Jul 21, 2017 at 9:15 AM, Clark Fitzgerald <[email protected]>
wrote:

> Regarding the R Consortium, the Distributed Computing Working Group led by
> Michael Lawrence would be interested in this. It would be nice to go to
> them with some working examples and use cases.
>
> Next week I will start looking into R / Arrow bindings. A couple other
> people at the UC Davis Data Science Initiative have expressed interest as
> well. I'll post updates here.
>
> On Wed, Jul 19, 2017 at 5:01 PM, Dean Chen <[email protected]> wrote:
>
> > Sounds good, will get a thread going there.
> >
> > On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <[email protected]>
> wrote:
> >
> > > Especially with Arrow support landing in Spark (SPARK-13534), it would
> > > be helpful to combine efforts between Python and R on this front. I
> > > also have a long list of improvements to the Feather format that will
> > > be substantially simpler once library(feather) is depending on the
> > > main Arrow libraries.
> > >
> > > I suggest you reach out to members of the R community directly on
> > > public forums about development help / advice and soliciting
> > > collaboration. There are other R venues where you can describe your
> > > use cases, like the R Consortium and its subcommittees:
> > > https://www.r-consortium.org/. I would go directly to the mailing
> > > lists and see if there is anyone who would like to get involved. It's
> > > more likely that you'll get attention on this problem in the R mailing
> > > lists than on the Arrow mailing list due to the chicken-and-egg
> > > aspect.
> > >
> > > As a side note, my opinion is that shared storage, memory formats, and
> > > computing libraries (e.g. native C++ libraries targeting Arrow memory)
> > > are going to be more and more important to the R / Python / Julia
> > > communities (and beyond -- Kou has been developing Arrow interfaces
> > > for Ruby, which has not traditionally had a large data science
> > > community) as time passes. I would like to personally do more on the R
> > > side but I simply don't have the bandwidth to take responsibility for
> > > another major component, especially not in an unfamiliar software
> > > development stack.
> > >
> > > Let me know how I can help, and if there are R mailing list
> > > discussions where we (the Arrow developers) can chime in please alert
> > > us to them here.
> > >
> > > - Wes
> > >
> > > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <[email protected]> wrote:
> > > > I also sent a note about it to the dev list a month ago. Still have a
> > > huge
> > > > internal need and interested in helping push this along where we can.
> > > > Unfortunately, our team is more focused around Spark and doesn't have
> > > much
> > > > experience working with the R community.
> > > >
> > > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > >> Hello all,
> > > >>
> > > >> I saw the notes come through from today's call:
> > > >>
> > > >> > * R Arrow Bindings?
> > > >> >  - Find use cases within the R community, contributors needed
> > > >> >  - R Feather bindings a useful starting point
> > > >>
> > > >> This year I've been working on parallel R on datasets in the 100+ GB
> > > range,
> > > >> and have found that loading and saving data from text files is a
> real
> > > >> bottleneck. Another consideration is breaking the data up into
> chunks
> > > for
> > > >> parallel processing while maintaining metadata and overall
> structure.
> > So
> > > >> I've been watching Parquet and Arrow.
> > > >>
> > > >> Specifically here are two use cases in R where Arrow / Parquet could
> > be
> > > >> helpful:
> > > >>
> > > >> - Splitting up a large data set into pieces which fit comfortably in
> > > memory
> > > >> then applying normal R functions to each piece. Basically GROUP BY.
> > > >> - Matloff's Software Alchemy, statistical averaging based on
> > independent
> > > >> chunks of data. This requires rows to be randomly assigned to
> chunks.
> > > >>
> > > >> Another option besides starting from the R Feather bindings is to
> > start
> > > >> with an automatically generated set of bindings:
> > > >> https://github.com/duncantl/RCodeGen
> > > >>
> > > >> Best,
> > > >> Clark Fitzgerald
> > > >>
> > > > --
> > > > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > > > <http://www.forbes.com/fintech/2016/#310668d56680>
> > > > 915 Broadway | Suite 502 | New York, NY 10010
> > > > (646)-838-2310 <(646)%20838-2310>
> > > > [email protected] | www.dv01.co
> > >
> > --
> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> > <http://www.forbes.com/fintech/2016/#310668d56680>
> > 915 Broadway | Suite 502 | New York, NY 10010
> > (646)-838-2310
> > [email protected] | www.dv01.co
> >
>

Re: Use case for R Arrow Bindings

Reply via email to