Sounds good, will get a thread going there. On Wed, Jul 19, 2017 at 6:02 PM Wes McKinney <[email protected]> wrote:
> Especially with Arrow support landing in Spark (SPARK-13534), it would > be helpful to combine efforts between Python and R on this front. I > also have a long list of improvements to the Feather format that will > be substantially simpler once library(feather) is depending on the > main Arrow libraries. > > I suggest you reach out to members of the R community directly on > public forums about development help / advice and soliciting > collaboration. There are other R venues where you can describe your > use cases, like the R Consortium and its subcommittees: > https://www.r-consortium.org/. I would go directly to the mailing > lists and see if there is anyone who would like to get involved. It's > more likely that you'll get attention on this problem in the R mailing > lists than on the Arrow mailing list due to the chicken-and-egg > aspect. > > As a side note, my opinion is that shared storage, memory formats, and > computing libraries (e.g. native C++ libraries targeting Arrow memory) > are going to be more and more important to the R / Python / Julia > communities (and beyond -- Kou has been developing Arrow interfaces > for Ruby, which has not traditionally had a large data science > community) as time passes. I would like to personally do more on the R > side but I simply don't have the bandwidth to take responsibility for > another major component, especially not in an unfamiliar software > development stack. > > Let me know how I can help, and if there are R mailing list > discussions where we (the Arrow developers) can chime in please alert > us to them here. > > - Wes > > On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <[email protected]> wrote: > > I also sent a note about it to the dev list a month ago. Still have a > huge > > internal need and interested in helping push this along where we can. > > Unfortunately, our team is more focused around Spark and doesn't have > much > > experience working with the R community. > > > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <[email protected]> > > wrote: > > > >> Hello all, > >> > >> I saw the notes come through from today's call: > >> > >> > * R Arrow Bindings? > >> > - Find use cases within the R community, contributors needed > >> > - R Feather bindings a useful starting point > >> > >> This year I've been working on parallel R on datasets in the 100+ GB > range, > >> and have found that loading and saving data from text files is a real > >> bottleneck. Another consideration is breaking the data up into chunks > for > >> parallel processing while maintaining metadata and overall structure. So > >> I've been watching Parquet and Arrow. > >> > >> Specifically here are two use cases in R where Arrow / Parquet could be > >> helpful: > >> > >> - Splitting up a large data set into pieces which fit comfortably in > memory > >> then applying normal R functions to each piece. Basically GROUP BY. > >> - Matloff's Software Alchemy, statistical averaging based on independent > >> chunks of data. This requires rows to be randomly assigned to chunks. > >> > >> Another option besides starting from the R Feather bindings is to start > >> with an automatically generated set of bindings: > >> https://github.com/duncantl/RCodeGen > >> > >> Best, > >> Clark Fitzgerald > >> > > -- > > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016 > > <http://www.forbes.com/fintech/2016/#310668d56680> > > 915 Broadway | Suite 502 | New York, NY 10010 > > (646)-838-2310 <(646)%20838-2310> > > [email protected] | www.dv01.co > -- VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016 <http://www.forbes.com/fintech/2016/#310668d56680> 915 Broadway | Suite 502 | New York, NY 10010 (646)-838-2310 [email protected] | www.dv01.co
