Re: Arrow and R benchmark

Wes McKinney Mon, 03 Dec 2018 19:04:10 -0800

hi Jonathan,
On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <chiang...@gmail.com> wrote:
>
> Hi Wes and Romain,
>
> I wrote a preliminary benchmark for reading and writing different file types 
> from R into arrow, borrowed some code from Hadley. I would like some feedback 
> to improve it and then possible push a R/benchmarks folder. I am willing to 
> dedicate most of next week to this project, as I am taking a vacation from 
> work, but would like to contribute to Arrow and R.
>
> To Romain: What is the difference in R when using tibble versus reading from 
> arrow?
> Is the general advantage that you can serialize the data to arrow when saving 
> it? Then be able to call it in Python with arrow then pandas?


Arrow has a language-independent binary protocol for data interchange
that does not require deserialization of data on read. It can be read
or written in many different ways: files, sockets, shared memory, etc.
How it gets used depends on the application

>
> General Roadmap Question to Wes and Romain :
> My vision for the future of data science, is the ability to serialize data 
> securely and pass data and models securely with some form of authentication 
> between IDEs with secure ports. This idea would develop with something 
> similar to gRPC, with more security designed with sharing data. I noticed 
> flight gRpc.
>

Correct, our plan for RPC is to use gRPC for secure transport of
components of the Arrow columnar protocol. We'd love to have more
developers involved with this effort.

> Also, I was interested if there was any momentum in  the R community to 
> serialize models similar to the work of Onnx into a unified model storage 
> system. The idea is to have a secure reproducible environment for R and 
> Python developer groups to readily share models and data, with the caveat 
> that data sent also has added security and possibly a history associated with 
> it for security. This piece of work, is something I am passionate in seeing 
> come to fruition. And would like to explore options for this actualization.
>

Here we are focused on efficient handling and processing of datasets.
These tools could be used to build a model storage system if so
desired.

> The background for me is to enable HealthCare teams to share medical data 
> securely among different analytics teams. The security provisions would 
> enable more robust cloud based storage and computation in a secure fashion.
>

I would like to see deeper integration with cloud storage services in
2019 in the core C++ libraries, which would be made available in R,
Python, Ruby, etc.

- Wes

> Thanks,
> Jonathan
>
>
>
> Side Note:
> Building arrow for R on Linux was a big hassle relative to mac. Was unable to 
> build on linux.
>
>
>
>
> On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <chiang...@gmail.com> wrote:
>>
>> I'll go through that python repo and see what I can do.
>>
>> Thanks,
>> Jonathan
>>
>> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>>
>>> I would suggest starting an r/benchmarks directory like we have in
>>> Python (https://github.com/apache/arrow/tree/master/python/benchmarks)
>>> and documenting the process for running all the benchmarks.
>>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <rom...@purrple.cat> wrote:
>>> >
>>> > Right now, most of the code examples is in the unit tests, but this is 
>>> > not measuring performance or stressing it. Perhaps you can start from 
>>> > there ?
>>> >
>>> > Romain
>>> >
>>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <wesmck...@gmail.com> a écrit :
>>> > >
>>> > > Adding dev@arrow.apache.org
>>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <chiang...@gmail.com> 
>>> > >> wrote:
>>> > >>
>>> > >> Hi,
>>> > >>
>>> > >> I would like to contribute to developing benchmark suites for R and 
>>> > >> Arrow? What would be the best way to start?
>>> > >>
>>> > >> Thanks,
>>> > >> Jonathan
>>> >

Re: Arrow and R benchmark

Reply via email to