Hi All, The thread about Apache Arrow and Google Cloud support did get some traction! Thanks to Micah for his suggestions.
*If everyone can STAR this link, we could get more visibility. *I'm guessing if Wes responds to the thread it would be a huge win. *https://issuetracker.google.com/issues/124858094 <https://issuetracker.google.com/issues/124858094>* Thanks, Jonathan On Tue, Feb 26, 2019 at 7:55 PM Wes McKinney <wesmck...@gmail.com> wrote: > Thanks Micah for the update. > > The continued investment in Apache Avro is interesting given the > low-activity state of that community. I'm optimistic that BQ will > offer native Arrow export at some point in the future, perhaps after > we reach a "1.0.0" release > > - Wes > > > On Sat, Feb 23, 2019 at 12:17 PM Jonathan Chiang <chiang...@gmail.com> > wrote: > > > > Hi Micah, > > > > Yes I filed the feature request from your advice. I will look more into > avro for my own bigquery use cases. Thanks for following up. > > > > Best, > > Jonathan > > > > On Feb 22, 2019, at 8:35 PM, Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > > Just to follow up on this thread, a new high throughput API [1] for > reading data out of big query was released to public beta today. The > format it streams is AVRO so it should be higher performance then parsing > JSON (and reads can be parallelized). Implementing AVRO reading was > something I was going to start working on in the next week or so, and I'll > probably continue on to add support to arrow C++ for the new API (I will be > creating JIRAs soon). Given my current bandwidth (I contribute to arrow on > my free time), this will take a while. So if people are interested in > collaborating (or taking this over) please let me know. > > > > Also, it looks like someone took my advice and filed a feature request > [2] for surfacing apache arrow natively. > > > > Thanks, > > Micah > > > > [1] https://cloud.google.com/bigquery/docs/reference/storage/ > > [2] https://issuetracker.google.com/issues/124858094 > > > > On Wed, Feb 13, 2019 at 1:25 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> Would someone like to make some feature requests to Google or engage > >> with them in another way? I have interacted with GCP in the past; I > >> think it would be helpful for them to hear from other Arrow users or > >> community members since I have been quite public as a carrier of the > >> Arrow banner. > >> > >> On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> > > >> > Disclaimer: I work for Google (not on BQ). Everything I'm going to > write > >> > reflects my own opinions, not those of my company. > >> > > >> > Jonathan and Wes, > >> > > >> > One way of trying to get support for this is filing a feature request > at > >> > [1] and getting broader customer support for it. Another possible > way of > >> > gaining broader exposure within Google is collaborating with other > open > >> > source projects that it contributes to. For instance there was a > >> > conversation recently about the potential use of Arrow on the Apache > Beam > >> > mailing list [2]. I will try to post a link to this thread > internally, but > >> > I can't make any promises and likely not give any updates on progress. > >> > > >> > This is also very much my own opinion, but I think in order to expose > Arrow > >> > in a public API it would be nice to reach a stable major release (i.e. > >> > 1.0.0) and ensure Arrow properly supports big query data-types > >> > appropriately [3], (I think it mostly does but date/time might be an > issue). > >> > > >> > [1] > >> > > https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product > >> > [2] > >> > > https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E > >> > [3] > https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types > >> > > >> > > >> > On Monday, February 4, 2019, Wes McKinney <wesmck...@gmail.com> > wrote: > >> > > >> > > Arrow support would be an obvious win for BigQuery. I've spoken with > >> > > people at Google Cloud about this in several occasions. > >> > > > >> > > With the gRPC / Flight work coming along it might be a good > >> > > opportunity to rekindle the discussion. If anyone from GCP is > reading > >> > > or if you know anyone at GCP who might be able to work with us I > would > >> > > be very interested. > >> > > > >> > > One hurdle for BigQuery is that my understanding is that Google has > >> > > policies in place that make it more difficult to take on external > >> > > library dependencies in a sensitive system like Dremel / BigQuery. > So > >> > > someone from Google might have to develop an in-house Arrow > >> > > implementation sufficient to send Arrow datasets from BigQuery to > >> > > clients. The scope of that project is small enough (requiring only > >> > > Flatbuffers as a dependency) that a motivated C or C++ developer at > >> > > Google ought to be able to get it done in a month or two of focused > >> > > work. > >> > > > >> > > - Wes > >> > > > >> > > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <chiang...@gmail.com > > > >> > > wrote: > >> > > > > >> > > > Hi Wes, > >> > > > > >> > > > I am currently working a lot with Google BigQuery in R and Python. > >> > > Hadley Wickham listed this as a big bottleneck for his library > bigrquery. > >> > > > > >> > > > The bottleneck for loading BigQuery data is now parsing > BigQuery’s JSON > >> > > format, which is difficult to optimise further because I’m already > using > >> > > the fastest C++ JSON parser, RapidJson. If this is still too slow > (because > >> > > you download a lot of data), see ?bq_table_download for an > alternative > >> > > approach. > >> > > > > >> > > > Is there any momentum for Arrow to partner with Google here? > >> > > > > >> > > > Thanks, > >> > > > > >> > > > Jonathan > >> > > > > >> > > > > >> > > > > >> > > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > > >> > >> > > >> hi Jonathan, > >> > > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang < > chiang...@gmail.com> > >> > > wrote: > >> > > >> > > >> > > >> > Hi Wes and Romain, > >> > > >> > > >> > > >> > I wrote a preliminary benchmark for reading and writing > different > >> > > file types from R into arrow, borrowed some code from Hadley. I > would like > >> > > some feedback to improve it and then possible push a R/benchmarks > folder. I > >> > > am willing to dedicate most of next week to this project, as I am > taking a > >> > > vacation from work, but would like to contribute to Arrow and R. > >> > > >> > > >> > > >> > To Romain: What is the difference in R when using tibble versus > >> > > reading from arrow? > >> > > >> > Is the general advantage that you can serialize the data to > arrow > >> > > when saving it? Then be able to call it in Python with arrow then > pandas? > >> > > >> > >> > > >> Arrow has a language-independent binary protocol for data > interchange > >> > > >> that does not require deserialization of data on read. It can be > read > >> > > >> or written in many different ways: files, sockets, shared > memory, etc. > >> > > >> How it gets used depends on the application > >> > > >> > >> > > >> > > >> > > >> > General Roadmap Question to Wes and Romain : > >> > > >> > My vision for the future of data science, is the ability to > serialize > >> > > data securely and pass data and models securely with some form of > >> > > authentication between IDEs with secure ports. This idea would > develop with > >> > > something similar to gRPC, with more security designed with sharing > data. I > >> > > noticed flight gRpc. > >> > > >> > > >> > > >> > >> > > >> Correct, our plan for RPC is to use gRPC for secure transport of > >> > > >> components of the Arrow columnar protocol. We'd love to have more > >> > > >> developers involved with this effort. > >> > > >> > >> > > >> > Also, I was interested if there was any momentum in the R > community > >> > > to serialize models similar to the work of Onnx into a unified model > >> > > storage system. The idea is to have a secure reproducible > environment for R > >> > > and Python developer groups to readily share models and data, with > the > >> > > caveat that data sent also has added security and possibly a history > >> > > associated with it for security. This piece of work, is something I > am > >> > > passionate in seeing come to fruition. And would like to explore > options > >> > > for this actualization. > >> > > >> > > >> > > >> > >> > > >> Here we are focused on efficient handling and processing of > datasets. > >> > > >> These tools could be used to build a model storage system if so > >> > > >> desired. > >> > > >> > >> > > >> > The background for me is to enable HealthCare teams to share > medical > >> > > data securely among different analytics teams. The security > provisions > >> > > would enable more robust cloud based storage and computation in a > secure > >> > > fashion. > >> > > >> > > >> > > >> > >> > > >> I would like to see deeper integration with cloud storage > services in > >> > > >> 2019 in the core C++ libraries, which would be made available in > R, > >> > > >> Python, Ruby, etc. > >> > > >> > >> > > >> - Wes > >> > > >> > >> > > >> > Thanks, > >> > > >> > Jonathan > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > Side Note: > >> > > >> > Building arrow for R on Linux was a big hassle relative to > mac. Was > >> > > unable to build on linux. > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang < > chiang...@gmail.com> > >> > > wrote: > >> > > >> >> > >> > > >> >> I'll go through that python repo and see what I can do. > >> > > >> >> > >> > > >> >> Thanks, > >> > > >> >> Jonathan > >> > > >> >> > >> > > >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney < > wesmck...@gmail.com> > >> > > wrote: > >> > > >> >>> > >> > > >> >>> I would suggest starting an r/benchmarks directory like we > have in > >> > > >> >>> Python ( > >> > > https://github.com/apache/arrow/tree/master/python/benchmarks) > >> > > >> >>> and documenting the process for running all the benchmarks. > >> > > >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François < > rom...@purrple.cat> > >> > > wrote: > >> > > >> >>> > > >> > > >> >>> > Right now, most of the code examples is in the unit tests, > but > >> > > this is not measuring performance or stressing it. Perhaps you can > start > >> > > from there ? > >> > > >> >>> > > >> > > >> >>> > Romain > >> > > >> >>> > > >> > > >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney < > wesmck...@gmail.com> a > >> > > écrit : > >> > > >> >>> > > > >> > > >> >>> > > Adding dev@arrow.apache.org > >> > > >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang < > >> > > chiang...@gmail.com> wrote: > >> > > >> >>> > >> > >> > > >> >>> > >> Hi, > >> > > >> >>> > >> > >> > > >> >>> > >> I would like to contribute to developing benchmark > suites for > >> > > R and Arrow? What would be the best way to start? > >> > > >> >>> > >> > >> > > >> >>> > >> Thanks, > >> > > >> >>> > >> Jonathan > >> > > >> >>> > > >> > > >