Re: Arrow and R benchmark

Wes McKinney Tue, 26 Feb 2019 19:56:42 -0800

Thanks Micah for the update.

The continued investment in Apache Avro is interesting given the
low-activity state of that community. I'm optimistic that BQ will
offer native Arrow export at some point in the future, perhaps after
we reach a "1.0.0" release


- Wes


On Sat, Feb 23, 2019 at 12:17 PM Jonathan Chiang <[email protected]> wrote:
>
> Hi Micah,
>
> Yes I filed the feature request from your advice. I will look more into avro 
> for my own bigquery use cases. Thanks for following up.
>
> Best,
> Jonathan
>
> On Feb 22, 2019, at 8:35 PM, Micah Kornfield <[email protected]> wrote:
>
> Just to follow up on this thread, a new high throughput API [1] for reading 
> data out of big query was released to public beta today.  The format it 
> streams is AVRO so it should be higher performance then parsing JSON (and 
> reads can be parallelized).  Implementing AVRO reading was something I was 
> going to start working on in the next week or so, and I'll probably continue 
> on to add support to arrow C++ for the new API (I will be creating JIRAs 
> soon).  Given my current bandwidth (I contribute to arrow on my free time), 
> this will take a while.  So if people are interested in collaborating (or 
> taking this over) please let me know.
>
> Also, it looks like someone took my advice and filed a feature request [2] 
> for surfacing apache arrow natively.
>
> Thanks,
> Micah
>
> [1] https://cloud.google.com/bigquery/docs/reference/storage/
> [2] https://issuetracker.google.com/issues/124858094
>
> On Wed, Feb 13, 2019 at 1:25 PM Wes McKinney <[email protected]> wrote:
>>
>> Would someone like to make some feature requests to Google or engage
>> with them in another way? I have interacted with GCP in the past; I
>> think it would be helpful for them to hear from other Arrow users or
>> community members since I have been quite public as a carrier of the
>> Arrow banner.
>>
>> On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield <[email protected]> 
>> wrote:
>> >
>> > Disclaimer: I work for Google (not on BQ).  Everything I'm going to write
>> > reflects my own opinions, not those of my company.
>> >
>> > Jonathan and Wes,
>> >
>> > One way of trying to get support for this is filing a feature request at
>> > [1] and getting broader customer support for it.  Another possible way of
>> > gaining broader exposure within Google is collaborating with other open
>> > source projects that it contributes to.  For instance there was a
>> > conversation recently about the potential use of Arrow on the Apache Beam
>> > mailing list [2].  I will try to post a link to this thread internally, but
>> > I can't make any promises and likely not give any updates on progress.
>> >
>> > This is also very much my own opinion, but I think in order to expose Arrow
>> > in a public API it would be nice to reach a stable major release (i.e.
>> > 1.0.0) and ensure Arrow properly supports big query data-types
>> > appropriately [3], (I think it mostly does but date/time might be an 
>> > issue).
>> >
>> > [1]
>> > https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
>> > [2]
>> > https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
>> > [3] 
>> > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
>> >
>> >
>> > On Monday, February 4, 2019, Wes McKinney <[email protected]> wrote:
>> >
>> > > Arrow support would be an obvious win for BigQuery. I've spoken with
>> > > people at Google Cloud about this in several occasions.
>> > >
>> > > With the gRPC / Flight work coming along it might be a good
>> > > opportunity to rekindle the discussion. If anyone from GCP is reading
>> > > or if you know anyone at GCP who might be able to work with us I would
>> > > be very interested.
>> > >
>> > > One hurdle for BigQuery is that my understanding is that Google has
>> > > policies in place that make it more difficult to take on external
>> > > library dependencies in a sensitive system like Dremel / BigQuery. So
>> > > someone from Google might have to develop an in-house Arrow
>> > > implementation sufficient to send Arrow datasets from BigQuery to
>> > > clients. The scope of that project is small enough (requiring only
>> > > Flatbuffers as a dependency) that a motivated C or C++ developer at
>> > > Google ought to be able to get it done in a month or two of focused
>> > > work.
>> > >
>> > > - Wes
>> > >
>> > > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <[email protected]>
>> > > wrote:
>> > > >
>> > > > Hi Wes,
>> > > >
>> > > > I am currently working a lot with Google BigQuery in R and Python.
>> > > Hadley Wickham listed this as a big bottleneck for his library bigrquery.
>> > > >
>> > > > The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON
>> > > format, which is difficult to optimise further because I’m already using
>> > > the fastest C++ JSON parser, RapidJson. If this is still too slow 
>> > > (because
>> > > you download a lot of data), see ?bq_table_download for an alternative
>> > > approach.
>> > > >
>> > > > Is there any momentum for Arrow to partner with Google here?
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Jonathan
>> > > >
>> > > >
>> > > >
>> > > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <[email protected]> 
>> > > > wrote:
>> > > >>
>> > > >> hi Jonathan,
>> > > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <[email protected]>
>> > > wrote:
>> > > >> >
>> > > >> > Hi Wes and Romain,
>> > > >> >
>> > > >> > I wrote a preliminary benchmark for reading and writing different
>> > > file types from R into arrow, borrowed some code from Hadley. I would 
>> > > like
>> > > some feedback to improve it and then possible push a R/benchmarks 
>> > > folder. I
>> > > am willing to dedicate most of next week to this project, as I am taking 
>> > > a
>> > > vacation from work, but would like to contribute to Arrow and R.
>> > > >> >
>> > > >> > To Romain: What is the difference in R when using tibble versus
>> > > reading from arrow?
>> > > >> > Is the general advantage that you can serialize the data to arrow
>> > > when saving it? Then be able to call it in Python with arrow then pandas?
>> > > >>
>> > > >> Arrow has a language-independent binary protocol for data interchange
>> > > >> that does not require deserialization of data on read. It can be read
>> > > >> or written in many different ways: files, sockets, shared memory, etc.
>> > > >> How it gets used depends on the application
>> > > >>
>> > > >> >
>> > > >> > General Roadmap Question to Wes and Romain :
>> > > >> > My vision for the future of data science, is the ability to 
>> > > >> > serialize
>> > > data securely and pass data and models securely with some form of
>> > > authentication between IDEs with secure ports. This idea would develop 
>> > > with
>> > > something similar to gRPC, with more security designed with sharing 
>> > > data. I
>> > > noticed flight gRpc.
>> > > >> >
>> > > >>
>> > > >> Correct, our plan for RPC is to use gRPC for secure transport of
>> > > >> components of the Arrow columnar protocol. We'd love to have more
>> > > >> developers involved with this effort.
>> > > >>
>> > > >> > Also, I was interested if there was any momentum in  the R community
>> > > to serialize models similar to the work of Onnx into a unified model
>> > > storage system. The idea is to have a secure reproducible environment 
>> > > for R
>> > > and Python developer groups to readily share models and data, with the
>> > > caveat that data sent also has added security and possibly a history
>> > > associated with it for security. This piece of work, is something I am
>> > > passionate in seeing come to fruition. And would like to explore options
>> > > for this actualization.
>> > > >> >
>> > > >>
>> > > >> Here we are focused on efficient handling and processing of datasets.
>> > > >> These tools could be used to build a model storage system if so
>> > > >> desired.
>> > > >>
>> > > >> > The background for me is to enable HealthCare teams to share medical
>> > > data securely among different analytics teams. The security provisions
>> > > would enable more robust cloud based storage and computation in a secure
>> > > fashion.
>> > > >> >
>> > > >>
>> > > >> I would like to see deeper integration with cloud storage services in
>> > > >> 2019 in the core C++ libraries, which would be made available in R,
>> > > >> Python, Ruby, etc.
>> > > >>
>> > > >> - Wes
>> > > >>
>> > > >> > Thanks,
>> > > >> > Jonathan
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > Side Note:
>> > > >> > Building arrow for R on Linux was a big hassle relative to mac. Was
>> > > unable to build on linux.
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> >
>> > > >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang 
>> > > >> > <[email protected]>
>> > > wrote:
>> > > >> >>
>> > > >> >> I'll go through that python repo and see what I can do.
>> > > >> >>
>> > > >> >> Thanks,
>> > > >> >> Jonathan
>> > > >> >>
>> > > >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <[email protected]>
>> > > wrote:
>> > > >> >>>
>> > > >> >>> I would suggest starting an r/benchmarks directory like we have in
>> > > >> >>> Python (
>> > > https://github.com/apache/arrow/tree/master/python/benchmarks)
>> > > >> >>> and documenting the process for running all the benchmarks.
>> > > >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François 
>> > > >> >>> <[email protected]>
>> > > wrote:
>> > > >> >>> >
>> > > >> >>> > Right now, most of the code examples is in the unit tests, but
>> > > this is not measuring performance or stressing it. Perhaps you can start
>> > > from there ?
>> > > >> >>> >
>> > > >> >>> > Romain
>> > > >> >>> >
>> > > >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <[email protected]> a
>> > > écrit :
>> > > >> >>> > >
>> > > >> >>> > > Adding [email protected]
>> > > >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
>> > > [email protected]> wrote:
>> > > >> >>> > >>
>> > > >> >>> > >> Hi,
>> > > >> >>> > >>
>> > > >> >>> > >> I would like to contribute to developing benchmark suites for
>> > > R and Arrow? What would be the best way to start?
>> > > >> >>> > >>
>> > > >> >>> > >> Thanks,
>> > > >> >>> > >> Jonathan
>> > > >> >>> >
>> > >

Re: Arrow and R benchmark

Reply via email to