Re: Spark and Arrow Flight

Bryan Cutler Tue, 09 Jul 2019 15:42:39 -0700

Great, thanks Ryan! I'll take a look

On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <rym...@dremio.com> wrote:


> Hi Bryan,
>
> I have an implementation of option #3 nearly ready for a PR. I will mention
> you when I publish it.
>
> The working prototype for the Spark connector is here:
> https://github.com/rymurr/flight-spark-source. It technically works (and
> is
> very fast!) however the implementation is pretty dodgy and needs to be
> cleaned up before ready for prime time. I plan to have it ready to go for
> the Arrow 1.0.0 release as an apache 2.0 licensed project. Please shout if
> you have any comments or are interested in contributing!
>
> Best,
> Ryan
>
> On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cutl...@gmail.com> wrote:
>
> > I'm in favor of option #3 also, but not sure what the best thing to do
> with
> > the existing FlightInfo response is. I'm definitely interested in
> > connecting Spark with Flight, can you share more details of your work or
> is
> > it planned to be open sourced?
> >
> > Thanks,
> > Bryan
> >
> > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> > >
> > > Either #3 or #4 for me.  If #3, the default GetSchema implementation
> can
> > > rely on calling GetFlightInfo.
> > >
> > >
> > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > >
> > > > We've been thinking about a similar issue, where sometimes we want
> > > > just the schema, but the service can't necessarily return the schema
> > > > without fetching data - right now we return a sentinel value in
> > > > GetFlightInfo, but a separate RPC would let us explicitly indicate an
> > > > error.
> > > >
> > > > I might be missing something though - what happens between step 1 and
> > > > 2 that makes the endpoints available? Would it make sense to use
> > > > DoAction to cause the backend to "prepare" the endpoints, and have
> the
> > > > result of that be an encoded schema? So then the flow would be
> > > > DoAction -> GetFlightInfo -> DoGet.
> > > >
> > > > Best,
> > > > David
> > > >
> > > > On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote:
> > > >> My inclination is either #2 or #3. #4 is an option of course, but I
> > > >> like the more structured solution of explicitly requesting the
> schema
> > > >> given a descriptor.
> > > >>
> > > >> In both cases, it's possible that schemas are sent twice, e.g. if
> you
> > > >> call GetSchema and then later call GetFlightInfo and so you receive
> > > >> the schema again. The schema is optional, so if it became a
> > > >> performance problem then a particular server might return the schema
> > > >> as null from GetFlightInfo.
> > > >>
> > > >> I think it's valid to want to make a single GetFlightInfo RPC
> request
> > > >> that returns _both_ the schema and the query plan.
> > > >>
> > > >> Thoughts from others?
> > > >>
> > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <jacq...@apache.org>
> > > wrote:
> > > >>>
> > > >>> My initial inclination is towards #3 but I'd be curious what others
> > > >>> think.
> > > >>> In the case of #3, I wonder if it makes sense to then pull the
> Schema
> > > off
> > > >>> the GetFlightInfo response...
> > > >>>
> > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <rym...@dremio.com>
> > > wrote:
> > > >>>
> > > >>>> Hi All,
> > > >>>>
> > > >>>> I have been working on building an arrow flight source for spark.
> > The
> > > >>>> goal
> > > >>>> here is for Spark to be able to use a group of arrow flight
> > endpoints
> > > >>>> to
> > > >>>> get a dataset pulled over to spark in parallel.
> > > >>>>
> > > >>>> I am unsure of the best model for the spark <-> flight
> conversation
> > > and
> > > >>>> wanted to get your opinion on the best way to go.
> > > >>>>
> > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> further
> > > >>>> lazy
> > > >>>> operations in Spark
> > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> > > >>>> different
> > > >>>> argument. This returns the list endpoints on the parallel flight
> > > >>>> server.
> > > >>>> The endpoints are not available till data is ready to be fetched,
> > > which
> > > >>>> is
> > > >>>> done after the schema but is needed before DoGet is called.
> > > >>>> 3) call get stream on all endpoints from 2
> > > >>>>
> > > >>>> I think I have to do each step however I don't like having to call
> > > >>>> getInfo
> > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > >>>> 1) live with calling GetFlightInfo twice and with a custom bytes
> cmd
> > > to
> > > >>>> differentiate the purpose of each call
> > > >>>> 2) add an argument to GetFlightInfo to tell it its being called
> only
> > > >>>> for
> > > >>>> the schema
> > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to
> > return
> > > >>>> just
> > > >>>> the Schema in question
> > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > >>>>
> > > >>>> I am aware that 4 is probably the least disruptive but I'm also
> not
> > a
> > > >>>> fan
> > > >>>> as (to me) it implies performing an action on the server side.
> > > >>>> Suggestions
> > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> there
> > is
> > > >>>> a
> > > >>>> consensus here. None of them are great options and I am wondering
> > what
> > > >>>> everyone thinks the best approach might be? Particularly as I
> think
> > > this
> > > >>>> is
> > > >>>> likely to come up in more applications than just spark.
> > > >>>>
> > > >>>> Best,
> > > >>>> Ryan
> > > >>>>
> > > >>
> > >
> >
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rym...@dremio.com
>
> <https://www.dremio.com/>
> Check out our GitHub <https://www.github.com/dremio>, join our community
> site <https://community.dremio.com/> & Download Dremio
> <https://www.dremio.com/download>
>

Re: Spark and Arrow Flight

Reply via email to