Re: Spark and Arrow Flight

2019-07-25 Thread David Li
Ah, I was just wondering what the status was, thanks for the info! I think that is a format change, so it would need to go through a vote here, though. Best, David On 7/25/19, Ryan Murray wrote: > Hey David, > > Yes I am. I have a 3/4 done patch ready to go, just got busy with a few > other thi

Re: Spark and Arrow Flight

2019-07-25 Thread Ryan Murray
Hey David, Yes I am. I have a 3/4 done patch ready to go, just got busy with a few other things. Are you hoping to use it soon? I would like to get to it this week but its looking increasingly unlikely. Best, Ryan On Thu, Jul 25, 2019 at 7:37 PM David Li wrote: > Hey Ryan, > > To follow up on

Re: Spark and Arrow Flight

2019-07-25 Thread David Li
Hey Ryan, To follow up on this, are you planning on formally proposing the GetSchema() call in Flight? I think it'd be interesting to have beyond the Spark usecase as finding the schema may or may not be expensive depending on the data stream (i.e. something computed on demand might require data t

Re: Spark and Arrow Flight

2019-07-10 Thread Wes McKinney
Of course, it might make just as much sense in Apache Spark. Probably worth bringing up with that community, too On Wed, Jul 10, 2019 at 12:37 PM Wes McKinney wrote: > > hi Ryan -- I was thinking that this might be built separately from the > main Java project. We don't have a model in the codeba

Re: Spark and Arrow Flight

2019-07-10 Thread Wes McKinney
hi Ryan -- I was thinking that this might be built separately from the main Java project. We don't have a model in the codebase yet for libraries that depend on the core libraries (this could be in an apps/ directory at the top level, so apps/spark-flight-source or something). So the development pr

Re: Spark and Arrow Flight

2019-07-10 Thread Ryan Murray
Hey Wes, Would be happy to! Jacques and I had originally thought to try and get it into Spark but perhaps Arrow might be a better home. I think the only issue is whether we want to bring Spark jars and their dependencies into Arrow. One challenge I have had so far with the connector is managing th

Re: Spark and Arrow Flight

2019-07-09 Thread Wes McKinney
Hi Ryan, have you thought about developing this inside Apache Arrow? On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler wrote: > Great, thanks Ryan! I'll take a look > > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray wrote: > > > Hi Bryan, > > > > I have an implementation of option #3 nearly ready for a PR.

Re: Spark and Arrow Flight

2019-07-09 Thread Bryan Cutler
Great, thanks Ryan! I'll take a look On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray wrote: > Hi Bryan, > > I have an implementation of option #3 nearly ready for a PR. I will mention > you when I publish it. > > The working prototype for the Spark connector is here: > https://github.com/rymurr/fligh

Re: Spark and Arrow Flight

2019-07-09 Thread Ryan Murray
Hi Bryan, I have an implementation of option #3 nearly ready for a PR. I will mention you when I publish it. The working prototype for the Spark connector is here: https://github.com/rymurr/flight-spark-source. It technically works (and is very fast!) however the implementation is pretty dodgy an

Re: Spark and Arrow Flight

2019-07-09 Thread Bryan Cutler
I'm in favor of option #3 also, but not sure what the best thing to do with the existing FlightInfo response is. I'm definitely interested in connecting Spark with Flight, can you share more details of your work or is it planned to be open sourced? Thanks, Bryan On Tue, Jul 2, 2019 at 3:35 AM Ant

Re: Spark and Arrow Flight

2019-07-02 Thread Antoine Pitrou
Either #3 or #4 for me. If #3, the default GetSchema implementation can rely on calling GetFlightInfo. Le 01/07/2019 à 22:50, David Li a écrit : > I think I'd prefer #3 over overloading an existing call (#2). > > We've been thinking about a similar issue, where sometimes we want > just the sc

Re: Spark and Arrow Flight

2019-07-01 Thread Wes McKinney
On Mon, Jul 1, 2019 at 3:50 PM David Li wrote: > > I think I'd prefer #3 over overloading an existing call (#2). > > We've been thinking about a similar issue, where sometimes we want > just the schema, but the service can't necessarily return the schema > without fetching data - right now we retu

Re: Spark and Arrow Flight

2019-07-01 Thread David Li
I think I'd prefer #3 over overloading an existing call (#2). We've been thinking about a similar issue, where sometimes we want just the schema, but the service can't necessarily return the schema without fetching data - right now we return a sentinel value in GetFlightInfo, but a separate RPC wo

Re: Spark and Arrow Flight

2019-07-01 Thread Wes McKinney
My inclination is either #2 or #3. #4 is an option of course, but I like the more structured solution of explicitly requesting the schema given a descriptor. In both cases, it's possible that schemas are sent twice, e.g. if you call GetSchema and then later call GetFlightInfo and so you receive th

Re: Spark and Arrow Flight

2019-06-28 Thread Jacques Nadeau
My initial inclination is towards #3 but I'd be curious what others think. In the case of #3, I wonder if it makes sense to then pull the Schema off the GetFlightInfo response... On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray wrote: > Hi All, > > I have been working on building an arrow flight sou

Spark and Arrow Flight

2019-06-28 Thread Ryan Murray
Hi All, I have been working on building an arrow flight source for spark. The goal here is for Spark to be able to use a group of arrow flight endpoints to get a dataset pulled over to spark in parallel. I am unsure of the best model for the spark <-> flight conversation and wanted to get your op