Re: Spark and Arrow Flight

Wes McKinney Tue, 09 Jul 2019 17:09:10 -0700

Hi Ryan, have you thought about developing this inside Apache Arrow?

On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <[email protected]> wrote:


> Great, thanks Ryan! I'll take a look
>
> On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <[email protected]> wrote:
>
> > Hi Bryan,
> >
> > I have an implementation of option #3 nearly ready for a PR. I will
> mention
> > you when I publish it.
> >
> > The working prototype for the Spark connector is here:
> > https://github.com/rymurr/flight-spark-source. It technically works (and
> > is
> > very fast!) however the implementation is pretty dodgy and needs to be
> > cleaned up before ready for prime time. I plan to have it ready to go for
> > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout
> if
> > you have any comments or are interested in contributing!
> >
> > Best,
> > Ryan
> >
> > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <[email protected]> wrote:
> >
> > > I'm in favor of option #3 also, but not sure what the best thing to do
> > with
> > > the existing FlightInfo response is. I'm definitely interested in
> > > connecting Spark with Flight, can you share more details of your work
> or
> > is
> > > it planned to be open sourced?
> > >
> > > Thanks,
> > > Bryan
> > >
> > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > > >
> > > > Either #3 or #4 for me.  If #3, the default GetSchema implementation
> > can
> > > > rely on calling GetFlightInfo.
> > > >
> > > >
> > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > >
> > > > > We've been thinking about a similar issue, where sometimes we want
> > > > > just the schema, but the service can't necessarily return the
> schema
> > > > > without fetching data - right now we return a sentinel value in
> > > > > GetFlightInfo, but a separate RPC would let us explicitly indicate
> an
> > > > > error.
> > > > >
> > > > > I might be missing something though - what happens between step 1
> and
> > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > DoAction to cause the backend to "prepare" the endpoints, and have
> > the
> > > > > result of that be an encoded schema? So then the flow would be
> > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > >
> > > > > Best,
> > > > > David
> > > > >
> > > > > On 7/1/19, Wes McKinney <[email protected]> wrote:
> > > > >> My inclination is either #2 or #3. #4 is an option of course, but
> I
> > > > >> like the more structured solution of explicitly requesting the
> > schema
> > > > >> given a descriptor.
> > > > >>
> > > > >> In both cases, it's possible that schemas are sent twice, e.g. if
> > you
> > > > >> call GetSchema and then later call GetFlightInfo and so you
> receive
> > > > >> the schema again. The schema is optional, so if it became a
> > > > >> performance problem then a particular server might return the
> schema
> > > > >> as null from GetFlightInfo.
> > > > >>
> > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > request
> > > > >> that returns _both_ the schema and the query plan.
> > > > >>
> > > > >> Thoughts from others?
> > > > >>
> > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> [email protected]>
> > > > wrote:
> > > > >>>
> > > > >>> My initial inclination is towards #3 but I'd be curious what
> others
> > > > >>> think.
> > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > Schema
> > > > off
> > > > >>> the GetFlightInfo response...
> > > > >>>
> > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <[email protected]>
> > > > wrote:
> > > > >>>
> > > > >>>> Hi All,
> > > > >>>>
> > > > >>>> I have been working on building an arrow flight source for
> spark.
> > > The
> > > > >>>> goal
> > > > >>>> here is for Spark to be able to use a group of arrow flight
> > > endpoints
> > > > >>>> to
> > > > >>>> get a dataset pulled over to spark in parallel.
> > > > >>>>
> > > > >>>> I am unsure of the best model for the spark <-> flight
> > conversation
> > > > and
> > > > >>>> wanted to get your opinion on the best way to go.
> > > > >>>>
> > > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> > further
> > > > >>>> lazy
> > > > >>>> operations in Spark
> > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> > > > >>>> different
> > > > >>>> argument. This returns the list endpoints on the parallel flight
> > > > >>>> server.
> > > > >>>> The endpoints are not available till data is ready to be
> fetched,
> > > > which
> > > > >>>> is
> > > > >>>> done after the schema but is needed before DoGet is called.
> > > > >>>> 3) call get stream on all endpoints from 2
> > > > >>>>
> > > > >>>> I think I have to do each step however I don't like having to
> call
> > > > >>>> getInfo
> > > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > > >>>> 1) live with calling GetFlightInfo twice and with a custom bytes
> > cmd
> > > > to
> > > > >>>> differentiate the purpose of each call
> > > > >>>> 2) add an argument to GetFlightInfo to tell it its being called
> > only
> > > > >>>> for
> > > > >>>> the schema
> > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > > return
> > > > >>>> just
> > > > >>>> the Schema in question
> > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > > >>>>
> > > > >>>> I am aware that 4 is probably the least disruptive but I'm also
> > not
> > > a
> > > > >>>> fan
> > > > >>>> as (to me) it implies performing an action on the server side.
> > > > >>>> Suggestions
> > > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> > there
> > > is
> > > > >>>> a
> > > > >>>> consensus here. None of them are great options and I am
> wondering
> > > what
> > > > >>>> everyone thinks the best approach might be? Particularly as I
> > think
> > > > this
> > > > >>>> is
> > > > >>>> likely to come up in more applications than just spark.
> > > > >>>>
> > > > >>>> Best,
> > > > >>>> Ryan
> > > > >>>>
> > > > >>
> > > >
> > >
> >
> >
> > --
> >
> > Ryan Murray  | Principal Consulting Engineer
> >
> > +447540852009 | [email protected]
> >
> > <https://www.dremio.com/>
> > Check out our GitHub <https://www.github.com/dremio>, join our community
> > site <https://community.dremio.com/> & Download Dremio
> > <https://www.dremio.com/download>
> >
>

Re: Spark and Arrow Flight

Reply via email to