Ah, I was just wondering what the status was, thanks for the info!

I think that is a format change, so it would need to go through a vote
here, though.

Best,
David

On 7/25/19, Ryan Murray <rym...@dremio.com> wrote:
> Hey David,
>
> Yes I am. I have a 3/4 done patch ready to go, just got busy with a few
> other things. Are you hoping to use it soon? I would like to get to it this
> week but its looking increasingly unlikely.
>
> Best,
> Ryan
>
> On Thu, Jul 25, 2019 at 7:37 PM David Li <li.david...@gmail.com> wrote:
>
>> Hey Ryan,
>>
>> To follow up on this, are you planning on formally proposing the
>> GetSchema() call in Flight? I think it'd be interesting to have beyond
>> the Spark usecase as finding the schema may or may not be expensive
>> depending on the data stream (i.e. something computed on demand might
>> require data to be computed in order to get the schema), and
>> separating it from GetFlightInfo means that services that "don't know"
>> the schema ahead of time can still respond to that endpoint quickly.
>> (We could make the change minimal by leaving the schema in FlightInfo
>> and simply specifying it as best-effort.)
>>
>> Best,
>> David
>>
>> On 7/10/19, Ryan Murray <rym...@dremio.com> wrote:
>> > Hey Wes,
>> >
>> > Would be happy to! Jacques and I had originally thought to try and get
>> > it
>> > into Spark but perhaps Arrow might be a better home. I think the only
>> issue
>> > is whether we want to bring Spark jars and their dependencies into
>> > Arrow.
>> > One challenge I have had so far with the connector is managing the
>> > transitive arrow dependencies from Spark, the connector only works on
>> > relatively recent versions of Spark and potentially can create circular
>> > arrow dependencies. I think this issue will be better once 1.0.0 is
>> > done
>> > and we can rely on a stable format/api.
>> >
>> > Best,
>> > Ryan
>> >
>> > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <wesmck...@gmail.com>
>> > wrote:
>> >
>> >> Hi Ryan, have you thought about developing this inside Apache Arrow?
>> >>
>> >> On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cutl...@gmail.com> wrote:
>> >>
>> >> > Great, thanks Ryan! I'll take a look
>> >> >
>> >> > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <rym...@dremio.com>
>> >> > wrote:
>> >> >
>> >> > > Hi Bryan,
>> >> > >
>> >> > > I have an implementation of option #3 nearly ready for a PR. I
>> >> > > will
>> >> > mention
>> >> > > you when I publish it.
>> >> > >
>> >> > > The working prototype for the Spark connector is here:
>> >> > > https://github.com/rymurr/flight-spark-source. It technically
>> >> > > works
>> >> (and
>> >> > > is
>> >> > > very fast!) however the implementation is pretty dodgy and needs
>> >> > > to
>> >> > > be
>> >> > > cleaned up before ready for prime time. I plan to have it ready to
>> go
>> >> for
>> >> > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please
>> >> > > shout
>> >> > if
>> >> > > you have any comments or are interested in contributing!
>> >> > >
>> >> > > Best,
>> >> > > Ryan
>> >> > >
>> >> > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cutl...@gmail.com>
>> >> > > wrote:
>> >> > >
>> >> > > > I'm in favor of option #3 also, but not sure what the best thing
>> to
>> >> do
>> >> > > with
>> >> > > > the existing FlightInfo response is. I'm definitely interested
>> >> > > > in
>> >> > > > connecting Spark with Flight, can you share more details of your
>> >> > > > work
>> >> > or
>> >> > > is
>> >> > > > it planned to be open sourced?
>> >> > > >
>> >> > > > Thanks,
>> >> > > > Bryan
>> >> > > >
>> >> > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou
>> >> > > > <anto...@python.org
>> >
>> >> > > wrote:
>> >> > > >
>> >> > > > >
>> >> > > > > Either #3 or #4 for me.  If #3, the default GetSchema
>> >> implementation
>> >> > > can
>> >> > > > > rely on calling GetFlightInfo.
>> >> > > > >
>> >> > > > >
>> >> > > > > Le 01/07/2019 à 22:50, David Li a écrit :
>> >> > > > > > I think I'd prefer #3 over overloading an existing call
>> >> > > > > > (#2).
>> >> > > > > >
>> >> > > > > > We've been thinking about a similar issue, where sometimes
>> >> > > > > > we
>> >> want
>> >> > > > > > just the schema, but the service can't necessarily return
>> >> > > > > > the
>> >> > schema
>> >> > > > > > without fetching data - right now we return a sentinel value
>> in
>> >> > > > > > GetFlightInfo, but a separate RPC would let us explicitly
>> >> indicate
>> >> > an
>> >> > > > > > error.
>> >> > > > > >
>> >> > > > > > I might be missing something though - what happens between
>> step
>> >> > > > > > 1
>> >> > and
>> >> > > > > > 2 that makes the endpoints available? Would it make sense to
>> >> > > > > > use
>> >> > > > > > DoAction to cause the backend to "prepare" the endpoints,
>> >> > > > > > and
>> >> have
>> >> > > the
>> >> > > > > > result of that be an encoded schema? So then the flow would
>> >> > > > > > be
>> >> > > > > > DoAction -> GetFlightInfo -> DoGet.
>> >> > > > > >
>> >> > > > > > Best,
>> >> > > > > > David
>> >> > > > > >
>> >> > > > > > On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote:
>> >> > > > > >> My inclination is either #2 or #3. #4 is an option of
>> >> > > > > >> course,
>> >> but
>> >> > I
>> >> > > > > >> like the more structured solution of explicitly requesting
>> the
>> >> > > schema
>> >> > > > > >> given a descriptor.
>> >> > > > > >>
>> >> > > > > >> In both cases, it's possible that schemas are sent twice,
>> e.g.
>> >> if
>> >> > > you
>> >> > > > > >> call GetSchema and then later call GetFlightInfo and so you
>> >> > receive
>> >> > > > > >> the schema again. The schema is optional, so if it became a
>> >> > > > > >> performance problem then a particular server might return
>> >> > > > > >> the
>> >> > schema
>> >> > > > > >> as null from GetFlightInfo.
>> >> > > > > >>
>> >> > > > > >> I think it's valid to want to make a single GetFlightInfo
>> >> > > > > >> RPC
>> >> > > request
>> >> > > > > >> that returns _both_ the schema and the query plan.
>> >> > > > > >>
>> >> > > > > >> Thoughts from others?
>> >> > > > > >>
>> >> > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
>> >> > jacq...@apache.org>
>> >> > > > > wrote:
>> >> > > > > >>>
>> >> > > > > >>> My initial inclination is towards #3 but I'd be curious
>> >> > > > > >>> what
>> >> > others
>> >> > > > > >>> think.
>> >> > > > > >>> In the case of #3, I wonder if it makes sense to then pull
>> >> > > > > >>> the
>> >> > > Schema
>> >> > > > > off
>> >> > > > > >>> the GetFlightInfo response...
>> >> > > > > >>>
>> >> > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
>> >> rym...@dremio.com>
>> >> > > > > wrote:
>> >> > > > > >>>
>> >> > > > > >>>> Hi All,
>> >> > > > > >>>>
>> >> > > > > >>>> I have been working on building an arrow flight source
>> >> > > > > >>>> for
>> >> > spark.
>> >> > > > The
>> >> > > > > >>>> goal
>> >> > > > > >>>> here is for Spark to be able to use a group of arrow
>> >> > > > > >>>> flight
>> >> > > > endpoints
>> >> > > > > >>>> to
>> >> > > > > >>>> get a dataset pulled over to spark in parallel.
>> >> > > > > >>>>
>> >> > > > > >>>> I am unsure of the best model for the spark <-> flight
>> >> > > conversation
>> >> > > > > and
>> >> > > > > >>>> wanted to get your opinion on the best way to go.
>> >> > > > > >>>>
>> >> > > > > >>>> I am breaking up the query to flight from spark into 3
>> >> > > > > >>>> parts:
>> >> > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to
>> >> > > > > >>>> do
>> >> > > further
>> >> > > > > >>>> lazy
>> >> > > > > >>>> operations in Spark
>> >> > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time
>> >> > > > > >>>> with
>> >> a
>> >> > > > > >>>> different
>> >> > > > > >>>> argument. This returns the list endpoints on the parallel
>> >> flight
>> >> > > > > >>>> server.
>> >> > > > > >>>> The endpoints are not available till data is ready to be
>> >> > fetched,
>> >> > > > > which
>> >> > > > > >>>> is
>> >> > > > > >>>> done after the schema but is needed before DoGet is
>> >> > > > > >>>> called.
>> >> > > > > >>>> 3) call get stream on all endpoints from 2
>> >> > > > > >>>>
>> >> > > > > >>>> I think I have to do each step however I don't like
>> >> > > > > >>>> having
>> >> > > > > >>>> to
>> >> > call
>> >> > > > > >>>> getInfo
>> >> > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
>> >> > > > > >>>> 1) live with calling GetFlightInfo twice and with a
>> >> > > > > >>>> custom
>> >> bytes
>> >> > > cmd
>> >> > > > > to
>> >> > > > > >>>> differentiate the purpose of each call
>> >> > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
>> >> called
>> >> > > only
>> >> > > > > >>>> for
>> >> > > > > >>>> the schema
>> >> > > > > >>>> 3) add another rpc endpoint: ie
>> >> > > > > >>>> GetSchema(FlightDescriptor)
>> >> > > > > >>>> to
>> >> > > > return
>> >> > > > > >>>> just
>> >> > > > > >>>> the Schema in question
>> >> > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a
>> Result
>> >> > > > > >>>>
>> >> > > > > >>>> I am aware that 4 is probably the least disruptive but
>> >> > > > > >>>> I'm
>> >> also
>> >> > > not
>> >> > > > a
>> >> > > > > >>>> fan
>> >> > > > > >>>> as (to me) it implies performing an action on the server
>> >> > > > > >>>> side.
>> >> > > > > >>>> Suggestions
>> >> > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that
>> >> > > > > >>>> unless
>> >> > > there
>> >> > > > is
>> >> > > > > >>>> a
>> >> > > > > >>>> consensus here. None of them are great options and I am
>> >> > wondering
>> >> > > > what
>> >> > > > > >>>> everyone thinks the best approach might be? Particularly
>> >> > > > > >>>> as
>> >> > > > > >>>> I
>> >> > > think
>> >> > > > > this
>> >> > > > > >>>> is
>> >> > > > > >>>> likely to come up in more applications than just spark.
>> >> > > > > >>>>
>> >> > > > > >>>> Best,
>> >> > > > > >>>> Ryan
>> >> > > > > >>>>
>> >> > > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > >
>> >> > > Ryan Murray  | Principal Consulting Engineer
>> >> > >
>> >> > > +447540852009 | rym...@dremio.com
>> >> > >
>> >> > > <https://www.dremio.com/>
>> >> > > Check out our GitHub <https://www.github.com/dremio>, join our
>> >> community
>> >> > > site <https://community.dremio.com/> & Download Dremio
>> >> > > <https://www.dremio.com/download>
>> >> > >
>> >> >
>> >>
>> >
>> >
>> > --
>> >
>> > Ryan Murray  | Principal Consulting Engineer
>> >
>> > +447540852009 | rym...@dremio.com
>> >
>> > <https://www.dremio.com/>
>> > Check out our GitHub <https://www.github.com/dremio>, join our
>> > community
>> > site <https://community.dremio.com/> & Download Dremio
>> > <https://www.dremio.com/download>
>> >
>>
>
>
> --
>
> Ryan Murray  | Principal Consulting Engineer
>
> +447540852009 | rym...@dremio.com
>
> <https://www.dremio.com/>
> Check out our GitHub <https://www.github.com/dremio>, join our community
> site <https://community.dremio.com/> & Download Dremio
> <https://www.dremio.com/download>
>

Reply via email to