+ Hadley

On Fri, Jul 21, 2017 at 2:04 PM, Bryan Cutler <cutl...@gmail.com> wrote:
> Thanks Clark.  I know that SparkR would benefit a lot from Arrow bindings
> and many people would like to see that, but to my knowledge no one has
> started working on this yet.  Please keep us updated with what you find!
>
> Bryan
>
> On Fri, Jul 21, 2017 at 9:15 AM, Clark Fitzgerald <clarkfi...@gmail.com>
> wrote:
>
>> Regarding the R Consortium, the Distributed Computing Working Group led by
>> Michael Lawrence would be interested in this. It would be nice to go to
>> them with some working examples and use cases.
>>
>> Next week I will start looking into R / Arrow bindings. A couple other
>> people at the UC Davis Data Science Initiative have expressed interest as
>> well. I'll post updates here.
>>
>> On Wed, Jul 19, 2017 at 5:01 PM, Dean Chen <d...@dv01.co> wrote:
>>
>> > Sounds good, will get a thread going there.
>> >
>> > On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> >
>> > > Especially with Arrow support landing in Spark (SPARK-13534), it would
>> > > be helpful to combine efforts between Python and R on this front. I
>> > > also have a long list of improvements to the Feather format that will
>> > > be substantially simpler once library(feather) is depending on the
>> > > main Arrow libraries.
>> > >
>> > > I suggest you reach out to members of the R community directly on
>> > > public forums about development help / advice and soliciting
>> > > collaboration. There are other R venues where you can describe your
>> > > use cases, like the R Consortium and its subcommittees:
>> > > https://www.r-consortium.org/. I would go directly to the mailing
>> > > lists and see if there is anyone who would like to get involved. It's
>> > > more likely that you'll get attention on this problem in the R mailing
>> > > lists than on the Arrow mailing list due to the chicken-and-egg
>> > > aspect.
>> > >
>> > > As a side note, my opinion is that shared storage, memory formats, and
>> > > computing libraries (e.g. native C++ libraries targeting Arrow memory)
>> > > are going to be more and more important to the R / Python / Julia
>> > > communities (and beyond -- Kou has been developing Arrow interfaces
>> > > for Ruby, which has not traditionally had a large data science
>> > > community) as time passes. I would like to personally do more on the R
>> > > side but I simply don't have the bandwidth to take responsibility for
>> > > another major component, especially not in an unfamiliar software
>> > > development stack.
>> > >
>> > > Let me know how I can help, and if there are R mailing list
>> > > discussions where we (the Arrow developers) can chime in please alert
>> > > us to them here.
>> > >
>> > > - Wes
>> > >
>> > > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <d...@dv01.co> wrote:
>> > > > I also sent a note about it to the dev list a month ago. Still have a
>> > > huge
>> > > > internal need and interested in helping push this along where we can.
>> > > > Unfortunately, our team is more focused around Spark and doesn't have
>> > > much
>> > > > experience working with the R community.
>> > > >
>> > > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <
>> clarkfi...@gmail.com
>> > >
>> > > > wrote:
>> > > >
>> > > >> Hello all,
>> > > >>
>> > > >> I saw the notes come through from today's call:
>> > > >>
>> > > >> > * R Arrow Bindings?
>> > > >> >  - Find use cases within the R community, contributors needed
>> > > >> >  - R Feather bindings a useful starting point
>> > > >>
>> > > >> This year I've been working on parallel R on datasets in the 100+ GB
>> > > range,
>> > > >> and have found that loading and saving data from text files is a
>> real
>> > > >> bottleneck. Another consideration is breaking the data up into
>> chunks
>> > > for
>> > > >> parallel processing while maintaining metadata and overall
>> structure.
>> > So
>> > > >> I've been watching Parquet and Arrow.
>> > > >>
>> > > >> Specifically here are two use cases in R where Arrow / Parquet could
>> > be
>> > > >> helpful:
>> > > >>
>> > > >> - Splitting up a large data set into pieces which fit comfortably in
>> > > memory
>> > > >> then applying normal R functions to each piece. Basically GROUP BY.
>> > > >> - Matloff's Software Alchemy, statistical averaging based on
>> > independent
>> > > >> chunks of data. This requires rows to be randomly assigned to
>> chunks.
>> > > >>
>> > > >> Another option besides starting from the R Feather bindings is to
>> > start
>> > > >> with an automatically generated set of bindings:
>> > > >> https://github.com/duncantl/RCodeGen
>> > > >>
>> > > >> Best,
>> > > >> Clark Fitzgerald
>> > > >>
>> > > > --
>> > > > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> > > > <http://www.forbes.com/fintech/2016/#310668d56680>
>> > > > 915 Broadway | Suite 502 | New York, NY 10010
>> > > > (646)-838-2310 <(646)%20838-2310>
>> > > > d...@dv01.co | www.dv01.co
>> > >
>> > --
>> > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> > <http://www.forbes.com/fintech/2016/#310668d56680>
>> > 915 Broadway | Suite 502 | New York, NY 10010
>> > (646)-838-2310
>> > d...@dv01.co | www.dv01.co
>> >
>>

Reply via email to