Re: Arrow and R benchmark

Wes McKinney Mon, 04 Feb 2019 17:19:34 -0800

Arrow support would be an obvious win for BigQuery. I've spoken with
people at Google Cloud about this in several occasions.


With the gRPC / Flight work coming along it might be a good
opportunity to rekindle the discussion. If anyone from GCP is reading
or if you know anyone at GCP who might be able to work with us I would
be very interested.

One hurdle for BigQuery is that my understanding is that Google has
policies in place that make it more difficult to take on external
library dependencies in a sensitive system like Dremel / BigQuery. So
someone from Google might have to develop an in-house Arrow
implementation sufficient to send Arrow datasets from BigQuery to
clients. The scope of that project is small enough (requiring only
Flatbuffers as a dependency) that a motivated C or C++ developer at
Google ought to be able to get it done in a month or two of focused
work.

- Wes

On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <[email protected]> wrote:
>
> Hi Wes,
>
> I am currently working a lot with Google BigQuery in R and Python. Hadley 
> Wickham listed this as a big bottleneck for his library bigrquery.
>
> The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON 
> format, which is difficult to optimise further because I’m already using the 
> fastest C++ JSON parser, RapidJson. If this is still too slow (because you 
> download a lot of data), see ?bq_table_download for an alternative approach.
>
> Is there any momentum for Arrow to partner with Google here?
>
> Thanks,
>
> Jonathan
>
>
>
> On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <[email protected]> wrote:
>>
>> hi Jonathan,
>> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <[email protected]> wrote:
>> >
>> > Hi Wes and Romain,
>> >
>> > I wrote a preliminary benchmark for reading and writing different file 
>> > types from R into arrow, borrowed some code from Hadley. I would like some 
>> > feedback to improve it and then possible push a R/benchmarks folder. I am 
>> > willing to dedicate most of next week to this project, as I am taking a 
>> > vacation from work, but would like to contribute to Arrow and R.
>> >
>> > To Romain: What is the difference in R when using tibble versus reading 
>> > from arrow?
>> > Is the general advantage that you can serialize the data to arrow when 
>> > saving it? Then be able to call it in Python with arrow then pandas?
>>
>> Arrow has a language-independent binary protocol for data interchange
>> that does not require deserialization of data on read. It can be read
>> or written in many different ways: files, sockets, shared memory, etc.
>> How it gets used depends on the application
>>
>> >
>> > General Roadmap Question to Wes and Romain :
>> > My vision for the future of data science, is the ability to serialize data 
>> > securely and pass data and models securely with some form of 
>> > authentication between IDEs with secure ports. This idea would develop 
>> > with something similar to gRPC, with more security designed with sharing 
>> > data. I noticed flight gRpc.
>> >
>>
>> Correct, our plan for RPC is to use gRPC for secure transport of
>> components of the Arrow columnar protocol. We'd love to have more
>> developers involved with this effort.
>>
>> > Also, I was interested if there was any momentum in  the R community to 
>> > serialize models similar to the work of Onnx into a unified model storage 
>> > system. The idea is to have a secure reproducible environment for R and 
>> > Python developer groups to readily share models and data, with the caveat 
>> > that data sent also has added security and possibly a history associated 
>> > with it for security. This piece of work, is something I am passionate in 
>> > seeing come to fruition. And would like to explore options for this 
>> > actualization.
>> >
>>
>> Here we are focused on efficient handling and processing of datasets.
>> These tools could be used to build a model storage system if so
>> desired.
>>
>> > The background for me is to enable HealthCare teams to share medical data 
>> > securely among different analytics teams. The security provisions would 
>> > enable more robust cloud based storage and computation in a secure fashion.
>> >
>>
>> I would like to see deeper integration with cloud storage services in
>> 2019 in the core C++ libraries, which would be made available in R,
>> Python, Ruby, etc.
>>
>> - Wes
>>
>> > Thanks,
>> > Jonathan
>> >
>> >
>> >
>> > Side Note:
>> > Building arrow for R on Linux was a big hassle relative to mac. Was unable 
>> > to build on linux.
>> >
>> >
>> >
>> >
>> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <[email protected]> 
>> > wrote:
>> >>
>> >> I'll go through that python repo and see what I can do.
>> >>
>> >> Thanks,
>> >> Jonathan
>> >>
>> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <[email protected]> wrote:
>> >>>
>> >>> I would suggest starting an r/benchmarks directory like we have in
>> >>> Python (https://github.com/apache/arrow/tree/master/python/benchmarks)
>> >>> and documenting the process for running all the benchmarks.
>> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <[email protected]> 
>> >>> wrote:
>> >>> >
>> >>> > Right now, most of the code examples is in the unit tests, but this is 
>> >>> > not measuring performance or stressing it. Perhaps you can start from 
>> >>> > there ?
>> >>> >
>> >>> > Romain
>> >>> >
>> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <[email protected]> a écrit :
>> >>> > >
>> >>> > > Adding [email protected]
>> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang 
>> >>> > >> <[email protected]> wrote:
>> >>> > >>
>> >>> > >> Hi,
>> >>> > >>
>> >>> > >> I would like to contribute to developing benchmark suites for R and 
>> >>> > >> Arrow? What would be the best way to start?
>> >>> > >>
>> >>> > >> Thanks,
>> >>> > >> Jonathan
>> >>> >

Re: Arrow and R benchmark

Reply via email to