hi Jonathan, On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <chiang...@gmail.com> wrote: > > Hi Wes and Romain, > > I wrote a preliminary benchmark for reading and writing different file types > from R into arrow, borrowed some code from Hadley. I would like some feedback > to improve it and then possible push a R/benchmarks folder. I am willing to > dedicate most of next week to this project, as I am taking a vacation from > work, but would like to contribute to Arrow and R. > > To Romain: What is the difference in R when using tibble versus reading from > arrow? > Is the general advantage that you can serialize the data to arrow when saving > it? Then be able to call it in Python with arrow then pandas?
Arrow has a language-independent binary protocol for data interchange that does not require deserialization of data on read. It can be read or written in many different ways: files, sockets, shared memory, etc. How it gets used depends on the application > > General Roadmap Question to Wes and Romain : > My vision for the future of data science, is the ability to serialize data > securely and pass data and models securely with some form of authentication > between IDEs with secure ports. This idea would develop with something > similar to gRPC, with more security designed with sharing data. I noticed > flight gRpc. > Correct, our plan for RPC is to use gRPC for secure transport of components of the Arrow columnar protocol. We'd love to have more developers involved with this effort. > Also, I was interested if there was any momentum in the R community to > serialize models similar to the work of Onnx into a unified model storage > system. The idea is to have a secure reproducible environment for R and > Python developer groups to readily share models and data, with the caveat > that data sent also has added security and possibly a history associated with > it for security. This piece of work, is something I am passionate in seeing > come to fruition. And would like to explore options for this actualization. > Here we are focused on efficient handling and processing of datasets. These tools could be used to build a model storage system if so desired. > The background for me is to enable HealthCare teams to share medical data > securely among different analytics teams. The security provisions would > enable more robust cloud based storage and computation in a secure fashion. > I would like to see deeper integration with cloud storage services in 2019 in the core C++ libraries, which would be made available in R, Python, Ruby, etc. - Wes > Thanks, > Jonathan > > > > Side Note: > Building arrow for R on Linux was a big hassle relative to mac. Was unable to > build on linux. > > > > > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <chiang...@gmail.com> wrote: >> >> I'll go through that python repo and see what I can do. >> >> Thanks, >> Jonathan >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> I would suggest starting an r/benchmarks directory like we have in >>> Python (https://github.com/apache/arrow/tree/master/python/benchmarks) >>> and documenting the process for running all the benchmarks. >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <rom...@purrple.cat> wrote: >>> > >>> > Right now, most of the code examples is in the unit tests, but this is >>> > not measuring performance or stressing it. Perhaps you can start from >>> > there ? >>> > >>> > Romain >>> > >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <wesmck...@gmail.com> a écrit : >>> > > >>> > > Adding dev@arrow.apache.org >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <chiang...@gmail.com> >>> > >> wrote: >>> > >> >>> > >> Hi, >>> > >> >>> > >> I would like to contribute to developing benchmark suites for R and >>> > >> Arrow? What would be the best way to start? >>> > >> >>> > >> Thanks, >>> > >> Jonathan >>> >