Either #3 or #4 for me. If #3, the default GetSchema implementation can rely on calling GetFlightInfo.
Le 01/07/2019 à 22:50, David Li a écrit : > I think I'd prefer #3 over overloading an existing call (#2). > > We've been thinking about a similar issue, where sometimes we want > just the schema, but the service can't necessarily return the schema > without fetching data - right now we return a sentinel value in > GetFlightInfo, but a separate RPC would let us explicitly indicate an > error. > > I might be missing something though - what happens between step 1 and > 2 that makes the endpoints available? Would it make sense to use > DoAction to cause the backend to "prepare" the endpoints, and have the > result of that be an encoded schema? So then the flow would be > DoAction -> GetFlightInfo -> DoGet. > > Best, > David > > On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote: >> My inclination is either #2 or #3. #4 is an option of course, but I >> like the more structured solution of explicitly requesting the schema >> given a descriptor. >> >> In both cases, it's possible that schemas are sent twice, e.g. if you >> call GetSchema and then later call GetFlightInfo and so you receive >> the schema again. The schema is optional, so if it became a >> performance problem then a particular server might return the schema >> as null from GetFlightInfo. >> >> I think it's valid to want to make a single GetFlightInfo RPC request >> that returns _both_ the schema and the query plan. >> >> Thoughts from others? >> >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <jacq...@apache.org> wrote: >>> >>> My initial inclination is towards #3 but I'd be curious what others >>> think. >>> In the case of #3, I wonder if it makes sense to then pull the Schema off >>> the GetFlightInfo response... >>> >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <rym...@dremio.com> wrote: >>> >>>> Hi All, >>>> >>>> I have been working on building an arrow flight source for spark. The >>>> goal >>>> here is for Spark to be able to use a group of arrow flight endpoints >>>> to >>>> get a dataset pulled over to spark in parallel. >>>> >>>> I am unsure of the best model for the spark <-> flight conversation and >>>> wanted to get your opinion on the best way to go. >>>> >>>> I am breaking up the query to flight from spark into 3 parts: >>>> 1) get the schema using GetFlightInfo. This is needed to do further >>>> lazy >>>> operations in Spark >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a >>>> different >>>> argument. This returns the list endpoints on the parallel flight >>>> server. >>>> The endpoints are not available till data is ready to be fetched, which >>>> is >>>> done after the schema but is needed before DoGet is called. >>>> 3) call get stream on all endpoints from 2 >>>> >>>> I think I have to do each step however I don't like having to call >>>> getInfo >>>> twice, it doesn't seem very elegant. I see a few options: >>>> 1) live with calling GetFlightInfo twice and with a custom bytes cmd to >>>> differentiate the purpose of each call >>>> 2) add an argument to GetFlightInfo to tell it its being called only >>>> for >>>> the schema >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return >>>> just >>>> the Schema in question >>>> 4) use DoAction and wrap the expected FlightInfo in a Result >>>> >>>> I am aware that 4 is probably the least disruptive but I'm also not a >>>> fan >>>> as (to me) it implies performing an action on the server side. >>>> Suggestions >>>> 2 & 3 are larger changes and I am reluctant to do that unless there is >>>> a >>>> consensus here. None of them are great options and I am wondering what >>>> everyone thinks the best approach might be? Particularly as I think this >>>> is >>>> likely to come up in more applications than just spark. >>>> >>>> Best, >>>> Ryan >>>> >>