Hey David, Yes I am. I have a 3/4 done patch ready to go, just got busy with a few other things. Are you hoping to use it soon? I would like to get to it this week but its looking increasingly unlikely.
Best, Ryan On Thu, Jul 25, 2019 at 7:37 PM David Li <li.david...@gmail.com> wrote: > Hey Ryan, > > To follow up on this, are you planning on formally proposing the > GetSchema() call in Flight? I think it'd be interesting to have beyond > the Spark usecase as finding the schema may or may not be expensive > depending on the data stream (i.e. something computed on demand might > require data to be computed in order to get the schema), and > separating it from GetFlightInfo means that services that "don't know" > the schema ahead of time can still respond to that endpoint quickly. > (We could make the change minimal by leaving the schema in FlightInfo > and simply specifying it as best-effort.) > > Best, > David > > On 7/10/19, Ryan Murray <rym...@dremio.com> wrote: > > Hey Wes, > > > > Would be happy to! Jacques and I had originally thought to try and get it > > into Spark but perhaps Arrow might be a better home. I think the only > issue > > is whether we want to bring Spark jars and their dependencies into Arrow. > > One challenge I have had so far with the connector is managing the > > transitive arrow dependencies from Spark, the connector only works on > > relatively recent versions of Spark and potentially can create circular > > arrow dependencies. I think this issue will be better once 1.0.0 is done > > and we can rely on a stable format/api. > > > > Best, > > Ryan > > > > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > >> Hi Ryan, have you thought about developing this inside Apache Arrow? > >> > >> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cutl...@gmail.com> wrote: > >> > >> > Great, thanks Ryan! I'll take a look > >> > > >> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <rym...@dremio.com> wrote: > >> > > >> > > Hi Bryan, > >> > > > >> > > I have an implementation of option #3 nearly ready for a PR. I will > >> > mention > >> > > you when I publish it. > >> > > > >> > > The working prototype for the Spark connector is here: > >> > > https://github.com/rymurr/flight-spark-source. It technically works > >> (and > >> > > is > >> > > very fast!) however the implementation is pretty dodgy and needs to > >> > > be > >> > > cleaned up before ready for prime time. I plan to have it ready to > go > >> for > >> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please > >> > > shout > >> > if > >> > > you have any comments or are interested in contributing! > >> > > > >> > > Best, > >> > > Ryan > >> > > > >> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cutl...@gmail.com> > >> > > wrote: > >> > > > >> > > > I'm in favor of option #3 also, but not sure what the best thing > to > >> do > >> > > with > >> > > > the existing FlightInfo response is. I'm definitely interested in > >> > > > connecting Spark with Flight, can you share more details of your > >> > > > work > >> > or > >> > > is > >> > > > it planned to be open sourced? > >> > > > > >> > > > Thanks, > >> > > > Bryan > >> > > > > >> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <anto...@python.org > > > >> > > wrote: > >> > > > > >> > > > > > >> > > > > Either #3 or #4 for me. If #3, the default GetSchema > >> implementation > >> > > can > >> > > > > rely on calling GetFlightInfo. > >> > > > > > >> > > > > > >> > > > > Le 01/07/2019 à 22:50, David Li a écrit : > >> > > > > > I think I'd prefer #3 over overloading an existing call (#2). > >> > > > > > > >> > > > > > We've been thinking about a similar issue, where sometimes we > >> want > >> > > > > > just the schema, but the service can't necessarily return the > >> > schema > >> > > > > > without fetching data - right now we return a sentinel value > in > >> > > > > > GetFlightInfo, but a separate RPC would let us explicitly > >> indicate > >> > an > >> > > > > > error. > >> > > > > > > >> > > > > > I might be missing something though - what happens between > step > >> > > > > > 1 > >> > and > >> > > > > > 2 that makes the endpoints available? Would it make sense to > >> > > > > > use > >> > > > > > DoAction to cause the backend to "prepare" the endpoints, and > >> have > >> > > the > >> > > > > > result of that be an encoded schema? So then the flow would be > >> > > > > > DoAction -> GetFlightInfo -> DoGet. > >> > > > > > > >> > > > > > Best, > >> > > > > > David > >> > > > > > > >> > > > > > On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote: > >> > > > > >> My inclination is either #2 or #3. #4 is an option of course, > >> but > >> > I > >> > > > > >> like the more structured solution of explicitly requesting > the > >> > > schema > >> > > > > >> given a descriptor. > >> > > > > >> > >> > > > > >> In both cases, it's possible that schemas are sent twice, > e.g. > >> if > >> > > you > >> > > > > >> call GetSchema and then later call GetFlightInfo and so you > >> > receive > >> > > > > >> the schema again. The schema is optional, so if it became a > >> > > > > >> performance problem then a particular server might return the > >> > schema > >> > > > > >> as null from GetFlightInfo. > >> > > > > >> > >> > > > > >> I think it's valid to want to make a single GetFlightInfo RPC > >> > > request > >> > > > > >> that returns _both_ the schema and the query plan. > >> > > > > >> > >> > > > > >> Thoughts from others? > >> > > > > >> > >> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau < > >> > jacq...@apache.org> > >> > > > > wrote: > >> > > > > >>> > >> > > > > >>> My initial inclination is towards #3 but I'd be curious what > >> > others > >> > > > > >>> think. > >> > > > > >>> In the case of #3, I wonder if it makes sense to then pull > >> > > > > >>> the > >> > > Schema > >> > > > > off > >> > > > > >>> the GetFlightInfo response... > >> > > > > >>> > >> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray < > >> rym...@dremio.com> > >> > > > > wrote: > >> > > > > >>> > >> > > > > >>>> Hi All, > >> > > > > >>>> > >> > > > > >>>> I have been working on building an arrow flight source for > >> > spark. > >> > > > The > >> > > > > >>>> goal > >> > > > > >>>> here is for Spark to be able to use a group of arrow flight > >> > > > endpoints > >> > > > > >>>> to > >> > > > > >>>> get a dataset pulled over to spark in parallel. > >> > > > > >>>> > >> > > > > >>>> I am unsure of the best model for the spark <-> flight > >> > > conversation > >> > > > > and > >> > > > > >>>> wanted to get your opinion on the best way to go. > >> > > > > >>>> > >> > > > > >>>> I am breaking up the query to flight from spark into 3 > >> > > > > >>>> parts: > >> > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do > >> > > further > >> > > > > >>>> lazy > >> > > > > >>>> operations in Spark > >> > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time > >> > > > > >>>> with > >> a > >> > > > > >>>> different > >> > > > > >>>> argument. This returns the list endpoints on the parallel > >> flight > >> > > > > >>>> server. > >> > > > > >>>> The endpoints are not available till data is ready to be > >> > fetched, > >> > > > > which > >> > > > > >>>> is > >> > > > > >>>> done after the schema but is needed before DoGet is called. > >> > > > > >>>> 3) call get stream on all endpoints from 2 > >> > > > > >>>> > >> > > > > >>>> I think I have to do each step however I don't like having > >> > > > > >>>> to > >> > call > >> > > > > >>>> getInfo > >> > > > > >>>> twice, it doesn't seem very elegant. I see a few options: > >> > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom > >> bytes > >> > > cmd > >> > > > > to > >> > > > > >>>> differentiate the purpose of each call > >> > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being > >> called > >> > > only > >> > > > > >>>> for > >> > > > > >>>> the schema > >> > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) > >> > > > > >>>> to > >> > > > return > >> > > > > >>>> just > >> > > > > >>>> the Schema in question > >> > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a > Result > >> > > > > >>>> > >> > > > > >>>> I am aware that 4 is probably the least disruptive but I'm > >> also > >> > > not > >> > > > a > >> > > > > >>>> fan > >> > > > > >>>> as (to me) it implies performing an action on the server > >> > > > > >>>> side. > >> > > > > >>>> Suggestions > >> > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that > >> > > > > >>>> unless > >> > > there > >> > > > is > >> > > > > >>>> a > >> > > > > >>>> consensus here. None of them are great options and I am > >> > wondering > >> > > > what > >> > > > > >>>> everyone thinks the best approach might be? Particularly as > >> > > > > >>>> I > >> > > think > >> > > > > this > >> > > > > >>>> is > >> > > > > >>>> likely to come up in more applications than just spark. > >> > > > > >>>> > >> > > > > >>>> Best, > >> > > > > >>>> Ryan > >> > > > > >>>> > >> > > > > >> > >> > > > > > >> > > > > >> > > > >> > > > >> > > -- > >> > > > >> > > Ryan Murray | Principal Consulting Engineer > >> > > > >> > > +447540852009 | rym...@dremio.com > >> > > > >> > > <https://www.dremio.com/> > >> > > Check out our GitHub <https://www.github.com/dremio>, join our > >> community > >> > > site <https://community.dremio.com/> & Download Dremio > >> > > <https://www.dremio.com/download> > >> > > > >> > > >> > > > > > > -- > > > > Ryan Murray | Principal Consulting Engineer > > > > +447540852009 | rym...@dremio.com > > > > <https://www.dremio.com/> > > Check out our GitHub <https://www.github.com/dremio>, join our community > > site <https://community.dremio.com/> & Download Dremio > > <https://www.dremio.com/download> > > > -- Ryan Murray | Principal Consulting Engineer +447540852009 | rym...@dremio.com <https://www.dremio.com/> Check out our GitHub <https://www.github.com/dremio>, join our community site <https://community.dremio.com/> & Download Dremio <https://www.dremio.com/download>