Re: Spark and Arrow Flight

Antoine Pitrou Tue, 02 Jul 2019 03:36:13 -0700


Either #3 or #4 for me.  If #3, the default GetSchema implementation can
rely on calling GetFlightInfo.



Le 01/07/2019 à 22:50, David Li a écrit :
> I think I'd prefer #3 over overloading an existing call (#2).
> 
> We've been thinking about a similar issue, where sometimes we want
> just the schema, but the service can't necessarily return the schema
> without fetching data - right now we return a sentinel value in
> GetFlightInfo, but a separate RPC would let us explicitly indicate an
> error.
> 
> I might be missing something though - what happens between step 1 and
> 2 that makes the endpoints available? Would it make sense to use
> DoAction to cause the backend to "prepare" the endpoints, and have the
> result of that be an encoded schema? So then the flow would be
> DoAction -> GetFlightInfo -> DoGet.
> 
> Best,
> David
> 
> On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote:
>> My inclination is either #2 or #3. #4 is an option of course, but I
>> like the more structured solution of explicitly requesting the schema
>> given a descriptor.
>>
>> In both cases, it's possible that schemas are sent twice, e.g. if you
>> call GetSchema and then later call GetFlightInfo and so you receive
>> the schema again. The schema is optional, so if it became a
>> performance problem then a particular server might return the schema
>> as null from GetFlightInfo.
>>
>> I think it's valid to want to make a single GetFlightInfo RPC request
>> that returns _both_ the schema and the query plan.
>>
>> Thoughts from others?
>>
>> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <jacq...@apache.org> wrote:
>>>
>>> My initial inclination is towards #3 but I'd be curious what others
>>> think.
>>> In the case of #3, I wonder if it makes sense to then pull the Schema off
>>> the GetFlightInfo response...
>>>
>>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <rym...@dremio.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I have been working on building an arrow flight source for spark. The
>>>> goal
>>>> here is for Spark to be able to use a group of arrow flight endpoints
>>>> to
>>>> get a dataset pulled over to spark in parallel.
>>>>
>>>> I am unsure of the best model for the spark <-> flight conversation and
>>>> wanted to get your opinion on the best way to go.
>>>>
>>>> I am breaking up the query to flight from spark into 3 parts:
>>>> 1) get the schema using GetFlightInfo. This is needed to do further
>>>> lazy
>>>> operations in Spark
>>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
>>>> different
>>>> argument. This returns the list endpoints on the parallel flight
>>>> server.
>>>> The endpoints are not available till data is ready to be fetched, which
>>>> is
>>>> done after the schema but is needed before DoGet is called.
>>>> 3) call get stream on all endpoints from 2
>>>>
>>>> I think I have to do each step however I don't like having to call
>>>> getInfo
>>>> twice, it doesn't seem very elegant. I see a few options:
>>>> 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
>>>> differentiate the purpose of each call
>>>> 2) add an argument to GetFlightInfo to tell it its being called only
>>>> for
>>>> the schema
>>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
>>>> just
>>>> the Schema in question
>>>> 4) use DoAction and wrap the expected FlightInfo in a Result
>>>>
>>>> I am aware that 4 is probably the least disruptive but I'm also not a
>>>> fan
>>>> as (to me) it implies performing an action on the server side.
>>>> Suggestions
>>>> 2 & 3 are larger changes and I am reluctant to do that unless there is
>>>> a
>>>> consensus here. None of them are great options and I am wondering what
>>>> everyone thinks the best approach might be? Particularly as I think this
>>>> is
>>>> likely to come up in more applications than just spark.
>>>>
>>>> Best,
>>>> Ryan
>>>>
>>

Re: Spark and Arrow Flight

Reply via email to