Ah, I was just wondering what the status was, thanks for the info! I think that is a format change, so it would need to go through a vote here, though.
Best, David On 7/25/19, Ryan Murray <rym...@dremio.com> wrote: > Hey David, > > Yes I am. I have a 3/4 done patch ready to go, just got busy with a few > other things. Are you hoping to use it soon? I would like to get to it this > week but its looking increasingly unlikely. > > Best, > Ryan > > On Thu, Jul 25, 2019 at 7:37 PM David Li <li.david...@gmail.com> wrote: > >> Hey Ryan, >> >> To follow up on this, are you planning on formally proposing the >> GetSchema() call in Flight? I think it'd be interesting to have beyond >> the Spark usecase as finding the schema may or may not be expensive >> depending on the data stream (i.e. something computed on demand might >> require data to be computed in order to get the schema), and >> separating it from GetFlightInfo means that services that "don't know" >> the schema ahead of time can still respond to that endpoint quickly. >> (We could make the change minimal by leaving the schema in FlightInfo >> and simply specifying it as best-effort.) >> >> Best, >> David >> >> On 7/10/19, Ryan Murray <rym...@dremio.com> wrote: >> > Hey Wes, >> > >> > Would be happy to! Jacques and I had originally thought to try and get >> > it >> > into Spark but perhaps Arrow might be a better home. I think the only >> issue >> > is whether we want to bring Spark jars and their dependencies into >> > Arrow. >> > One challenge I have had so far with the connector is managing the >> > transitive arrow dependencies from Spark, the connector only works on >> > relatively recent versions of Spark and potentially can create circular >> > arrow dependencies. I think this issue will be better once 1.0.0 is >> > done >> > and we can rely on a stable format/api. >> > >> > Best, >> > Ryan >> > >> > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <wesmck...@gmail.com> >> > wrote: >> > >> >> Hi Ryan, have you thought about developing this inside Apache Arrow? >> >> >> >> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cutl...@gmail.com> wrote: >> >> >> >> > Great, thanks Ryan! I'll take a look >> >> > >> >> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <rym...@dremio.com> >> >> > wrote: >> >> > >> >> > > Hi Bryan, >> >> > > >> >> > > I have an implementation of option #3 nearly ready for a PR. I >> >> > > will >> >> > mention >> >> > > you when I publish it. >> >> > > >> >> > > The working prototype for the Spark connector is here: >> >> > > https://github.com/rymurr/flight-spark-source. It technically >> >> > > works >> >> (and >> >> > > is >> >> > > very fast!) however the implementation is pretty dodgy and needs >> >> > > to >> >> > > be >> >> > > cleaned up before ready for prime time. I plan to have it ready to >> go >> >> for >> >> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please >> >> > > shout >> >> > if >> >> > > you have any comments or are interested in contributing! >> >> > > >> >> > > Best, >> >> > > Ryan >> >> > > >> >> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cutl...@gmail.com> >> >> > > wrote: >> >> > > >> >> > > > I'm in favor of option #3 also, but not sure what the best thing >> to >> >> do >> >> > > with >> >> > > > the existing FlightInfo response is. I'm definitely interested >> >> > > > in >> >> > > > connecting Spark with Flight, can you share more details of your >> >> > > > work >> >> > or >> >> > > is >> >> > > > it planned to be open sourced? >> >> > > > >> >> > > > Thanks, >> >> > > > Bryan >> >> > > > >> >> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou >> >> > > > <anto...@python.org >> > >> >> > > wrote: >> >> > > > >> >> > > > > >> >> > > > > Either #3 or #4 for me. If #3, the default GetSchema >> >> implementation >> >> > > can >> >> > > > > rely on calling GetFlightInfo. >> >> > > > > >> >> > > > > >> >> > > > > Le 01/07/2019 à 22:50, David Li a écrit : >> >> > > > > > I think I'd prefer #3 over overloading an existing call >> >> > > > > > (#2). >> >> > > > > > >> >> > > > > > We've been thinking about a similar issue, where sometimes >> >> > > > > > we >> >> want >> >> > > > > > just the schema, but the service can't necessarily return >> >> > > > > > the >> >> > schema >> >> > > > > > without fetching data - right now we return a sentinel value >> in >> >> > > > > > GetFlightInfo, but a separate RPC would let us explicitly >> >> indicate >> >> > an >> >> > > > > > error. >> >> > > > > > >> >> > > > > > I might be missing something though - what happens between >> step >> >> > > > > > 1 >> >> > and >> >> > > > > > 2 that makes the endpoints available? Would it make sense to >> >> > > > > > use >> >> > > > > > DoAction to cause the backend to "prepare" the endpoints, >> >> > > > > > and >> >> have >> >> > > the >> >> > > > > > result of that be an encoded schema? So then the flow would >> >> > > > > > be >> >> > > > > > DoAction -> GetFlightInfo -> DoGet. >> >> > > > > > >> >> > > > > > Best, >> >> > > > > > David >> >> > > > > > >> >> > > > > > On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote: >> >> > > > > >> My inclination is either #2 or #3. #4 is an option of >> >> > > > > >> course, >> >> but >> >> > I >> >> > > > > >> like the more structured solution of explicitly requesting >> the >> >> > > schema >> >> > > > > >> given a descriptor. >> >> > > > > >> >> >> > > > > >> In both cases, it's possible that schemas are sent twice, >> e.g. >> >> if >> >> > > you >> >> > > > > >> call GetSchema and then later call GetFlightInfo and so you >> >> > receive >> >> > > > > >> the schema again. The schema is optional, so if it became a >> >> > > > > >> performance problem then a particular server might return >> >> > > > > >> the >> >> > schema >> >> > > > > >> as null from GetFlightInfo. >> >> > > > > >> >> >> > > > > >> I think it's valid to want to make a single GetFlightInfo >> >> > > > > >> RPC >> >> > > request >> >> > > > > >> that returns _both_ the schema and the query plan. >> >> > > > > >> >> >> > > > > >> Thoughts from others? >> >> > > > > >> >> >> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau < >> >> > jacq...@apache.org> >> >> > > > > wrote: >> >> > > > > >>> >> >> > > > > >>> My initial inclination is towards #3 but I'd be curious >> >> > > > > >>> what >> >> > others >> >> > > > > >>> think. >> >> > > > > >>> In the case of #3, I wonder if it makes sense to then pull >> >> > > > > >>> the >> >> > > Schema >> >> > > > > off >> >> > > > > >>> the GetFlightInfo response... >> >> > > > > >>> >> >> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray < >> >> rym...@dremio.com> >> >> > > > > wrote: >> >> > > > > >>> >> >> > > > > >>>> Hi All, >> >> > > > > >>>> >> >> > > > > >>>> I have been working on building an arrow flight source >> >> > > > > >>>> for >> >> > spark. >> >> > > > The >> >> > > > > >>>> goal >> >> > > > > >>>> here is for Spark to be able to use a group of arrow >> >> > > > > >>>> flight >> >> > > > endpoints >> >> > > > > >>>> to >> >> > > > > >>>> get a dataset pulled over to spark in parallel. >> >> > > > > >>>> >> >> > > > > >>>> I am unsure of the best model for the spark <-> flight >> >> > > conversation >> >> > > > > and >> >> > > > > >>>> wanted to get your opinion on the best way to go. >> >> > > > > >>>> >> >> > > > > >>>> I am breaking up the query to flight from spark into 3 >> >> > > > > >>>> parts: >> >> > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to >> >> > > > > >>>> do >> >> > > further >> >> > > > > >>>> lazy >> >> > > > > >>>> operations in Spark >> >> > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time >> >> > > > > >>>> with >> >> a >> >> > > > > >>>> different >> >> > > > > >>>> argument. This returns the list endpoints on the parallel >> >> flight >> >> > > > > >>>> server. >> >> > > > > >>>> The endpoints are not available till data is ready to be >> >> > fetched, >> >> > > > > which >> >> > > > > >>>> is >> >> > > > > >>>> done after the schema but is needed before DoGet is >> >> > > > > >>>> called. >> >> > > > > >>>> 3) call get stream on all endpoints from 2 >> >> > > > > >>>> >> >> > > > > >>>> I think I have to do each step however I don't like >> >> > > > > >>>> having >> >> > > > > >>>> to >> >> > call >> >> > > > > >>>> getInfo >> >> > > > > >>>> twice, it doesn't seem very elegant. I see a few options: >> >> > > > > >>>> 1) live with calling GetFlightInfo twice and with a >> >> > > > > >>>> custom >> >> bytes >> >> > > cmd >> >> > > > > to >> >> > > > > >>>> differentiate the purpose of each call >> >> > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being >> >> called >> >> > > only >> >> > > > > >>>> for >> >> > > > > >>>> the schema >> >> > > > > >>>> 3) add another rpc endpoint: ie >> >> > > > > >>>> GetSchema(FlightDescriptor) >> >> > > > > >>>> to >> >> > > > return >> >> > > > > >>>> just >> >> > > > > >>>> the Schema in question >> >> > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a >> Result >> >> > > > > >>>> >> >> > > > > >>>> I am aware that 4 is probably the least disruptive but >> >> > > > > >>>> I'm >> >> also >> >> > > not >> >> > > > a >> >> > > > > >>>> fan >> >> > > > > >>>> as (to me) it implies performing an action on the server >> >> > > > > >>>> side. >> >> > > > > >>>> Suggestions >> >> > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that >> >> > > > > >>>> unless >> >> > > there >> >> > > > is >> >> > > > > >>>> a >> >> > > > > >>>> consensus here. None of them are great options and I am >> >> > wondering >> >> > > > what >> >> > > > > >>>> everyone thinks the best approach might be? Particularly >> >> > > > > >>>> as >> >> > > > > >>>> I >> >> > > think >> >> > > > > this >> >> > > > > >>>> is >> >> > > > > >>>> likely to come up in more applications than just spark. >> >> > > > > >>>> >> >> > > > > >>>> Best, >> >> > > > > >>>> Ryan >> >> > > > > >>>> >> >> > > > > >> >> >> > > > > >> >> > > > >> >> > > >> >> > > >> >> > > -- >> >> > > >> >> > > Ryan Murray | Principal Consulting Engineer >> >> > > >> >> > > +447540852009 | rym...@dremio.com >> >> > > >> >> > > <https://www.dremio.com/> >> >> > > Check out our GitHub <https://www.github.com/dremio>, join our >> >> community >> >> > > site <https://community.dremio.com/> & Download Dremio >> >> > > <https://www.dremio.com/download> >> >> > > >> >> > >> >> >> > >> > >> > -- >> > >> > Ryan Murray | Principal Consulting Engineer >> > >> > +447540852009 | rym...@dremio.com >> > >> > <https://www.dremio.com/> >> > Check out our GitHub <https://www.github.com/dremio>, join our >> > community >> > site <https://community.dremio.com/> & Download Dremio >> > <https://www.dremio.com/download> >> > >> > > > -- > > Ryan Murray | Principal Consulting Engineer > > +447540852009 | rym...@dremio.com > > <https://www.dremio.com/> > Check out our GitHub <https://www.github.com/dremio>, join our community > site <https://community.dremio.com/> & Download Dremio > <https://www.dremio.com/download> >