Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Nate Bauernfeind Fri, 25 Jun 2021 17:18:25 -0700

> makes it more difficult to bring schema evolution back into the
> IPC Stream format (i.e. it would live only in flight)


Gosh's proposal extends the flatbuffer structures not the protobufs. Can
you help me understand how difficult it would be to bring the `schema_id`
approach to the IPC stream format?

Side Question: Why isn't the IPC stream format a series of the flight
protobufs? It's a real shame that there is no standard way to
capture/replay a stream with app_metadata. (Obviously ignoring the
annoyances around protobuf wrapping flatbuffers.)

> In both cases there are also potentially thorny issues related to
> dictionary usages that we will need to resolve.

I hope changing how dictionaries work can be avoided. There are some nice
benefits to sharing a dictionary across schema evolution (esp. for frequent
switches between a few schemas). Don't forget, we already have the ability
to clear / overwrite existing dictionaries. It might be nice to leave the
behavior as-is.

On Fri, Jun 25, 2021 at 4:07 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Sorry for the second reply:
>
>
>>    1. In our case we do expect relatively frequent changes in the schema
>>    of the batch being sent out. I don't see that pattern changing in the mid
>>    term for a good reason. However long term maybe it will be possible to
>>    leverage separate RPC calls. I left some description in the comments 
>> thread
>>    here
>>    
>> <https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?disco=AAAAMkkJ9xQ>
>>    .
>>
>> I think we should clarify between two types of changes:
> 1.  Completely new  unique schema needed.
> 2.  Relatively frequent switches between a few known schemas.
>
> On Fri, Jun 25, 2021 at 2:58 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>>
>>>    1. Re complexity of "one schema at a time" vs "schema id based": i
>>>    think they are not much different, right? In fact the second one is more 
>>> of
>>>    an optimization to the first one which is beneficial to us. Anyways even
>>>    with the first approach you need to add some logic of schema change. 
>>> Adding
>>>    the schema identification is not that complex at first glance.
>>>
>>> The second approach requires more state being maintained on the client
>> side and makes it more difficult to bring schema evolution back into the
>> IPC Stream format (i.e. it would live only in flight) without additional
>> changes.
>>
>> In both cases there are also potentially thorny issues related to
>> dictionary usages that we will need to resolve.  I haven't thought deeply
>> about it but it seems that these might be more complex on a multiplexed
>> channel.
>>
>>
>> On Fri, Jun 25, 2021 at 11:52 AM Gosh Arzumanyan <gosh...@gmail.com>
>> wrote:
>>
>>> Hi Micah,
>>>
>>> Sure, let me do it here:
>>>
>>>    1. In our case we do expect relatively frequent changes in the
>>>    schema of the batch being sent out. I don't see that pattern changing in
>>>    the mid term for a good reason. However long term maybe it will be 
>>> possible
>>>    to leverage separate RPC calls. I left some description in the comments
>>>    thread here
>>>    
>>> <https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?disco=AAAAMkkJ9xQ>
>>>    .
>>>    2. Regarding the union of structs trick: while it is a good
>>>    workaround for most of the cases as of now there are also some aspects
>>>    which deserve consideration:
>>>       1. There is a  space overhead
>>>       2. There is an additional glue code to make it work
>>>       3. Considering that this solution is bubbling up frequently in
>>>       similar contexts, seems like users might benefit if it  was "natively"
>>>       supported and they could just focus on populating the schemas they 
>>> really
>>>       need to.
>>>    3. Re complexity of "one schema at a time" vs "schema id based": i
>>>    think they are not much different, right? In fact the second one is more 
>>> of
>>>    an optimization to the first one which is beneficial to us. Anyways even
>>>    with the first approach you need to add some logic of schema change. 
>>> Adding
>>>    the schema identification is not that complex at first glance.
>>>
>>> Cheers,
>>> Gosh
>>>
>>> On Fri, Jun 25, 2021 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com>
>>> wrote:
>>>
>>>> >
>>>> > 1. It seems like renaming stream_id to schema_id and delegating
>>>> "logical
>>>> > stream" distinction to app_metadata mitigates the "multiplexing" point
>>>> > while at the same time it gives enough flexibility to address both
>>>> Nate's
>>>> > and our use cases.
>>>>
>>>>
>>>> I don't think this is the case.  It seems that having no additional
>>>> fields
>>>> added and then sending a new schema when necessary combined with Union
>>>> of
>>>> Structs would solve most use-cases.  The main downside could be
>>>> potential
>>>> performance implications if the schema is changing frequently.  Gosh,
>>>> could
>>>> you address why this wouldn't be sufficient (either here or on the doc).
>>>>
>>>> Thanks,
>>>> -Micah
>>>>
>>>> On Fri, Jun 25, 2021 at 5:30 AM Gosh Arzumanyan <gosh...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi guys,
>>>> >
>>>> > Thanks for sharing your insights/concerns! I also left some comments
>>>> based
>>>> > on the discussion we had. Briefly:
>>>> >
>>>> > 1. It seems like renaming stream_id to schema_id and delegating
>>>> "logical
>>>> > stream" distinction to app_metadata mitigates the "multiplexing" point
>>>> > while at the same time it gives enough flexibility to address both
>>>> Nate's
>>>> > and our use cases.
>>>> > 2. To David's point about other transports: in fact currently we are
>>>> using
>>>> > other transports(aside from gRPC) so we don't wanna depend on only
>>>> gRPC
>>>> > features.
>>>> >
>>>> > Cheers,
>>>> > Gosh
>>>> >
>>>> > On Wed, Jun 23, 2021 at 10:40 PM David Li <lidav...@apache.org>
>>>> wrote:
>>>> >
>>>> > > Thanks for chiming in - I've replied in the doc. Scoping it to just
>>>> > schema
>>>> > > evolution would be preferable, but I'm not sure if Gosh's usecase
>>>> > requires
>>>> > > more flexibility than that or not.
>>>> > >
>>>> > > Again, though, given that 1) gRPC recycles a connection, so repeated
>>>> > calls
>>>> > > aren't necessarily expensive and 2) encoding tricks like
>>>> > union-of-structs,
>>>> > > any solution needs to be weighed against those/we should make sure
>>>> to
>>>> > > document why they aren't sufficient. (For instance, 1) is hampered
>>>> by the
>>>> > > use of L7 load balancers and/or client-side load balancing policies
>>>> in
>>>> > gRPC
>>>> > > and assumes statefulness which is undesirable in general. There's
>>>> also
>>>> > the
>>>> > > eventual desire to have a transport besides gRPC someday.)
>>>> > >
>>>> > > -David
>>>> > >
>>>> > > On Wed, Jun 23, 2021, at 16:24, Nate Bauernfeind wrote:
>>>> > >
>>>> > > Thanks for writing this up! I added a few general comments, but
>>>> have a
>>>> > > question on the approach because it's not quite what I was
>>>> expecting.
>>>> > >
>>>> > > I am slightly concerned that the proposal looks more like support
>>>> for
>>>> > > "multiplexing" IPC streams into a single RPC stream rather than
>>>> support
>>>> > for
>>>> > > a changing Schema of an otherwise consistently logical stream. gRPC
>>>> > already
>>>> > > does a good job decoupling RPC streams from one another. I feel that
>>>> > > throwing that idea into the IPC stream increases client-library
>>>> > > implementation cost by quite a lot.
>>>> > >
>>>> > > Why is it not good enough to replace the Schema when we see a
>>>> duplicate?
>>>> > > This is undoubtedly less work across all client implementations.
>>>> > >
>>>> > > The benefit I see is that you might have two schemas that you swap
>>>> > between
>>>> > > frequently then you can indicate with a single integer. If that's
>>>> what
>>>> > you
>>>> > > want to support I would rather think of them as `schema_id` instead
>>>> of
>>>> > > `stream_id` and not give this impression that multiplexing is a
>>>> goal. As
>>>> > > you have proposed, it seems that the "done writing for a stream"
>>>> needs a
>>>> > > callback notifying the user receiving the stream that a logical
>>>> subset of
>>>> > > the flight is complete. Alternatively, if they aren't independent
>>>> streams
>>>> > > (to the end-user), we could tell the Arrow layer that a particular
>>>> schema
>>>> > > is no longer needed without also needing to communicate further
>>>> > downstream.
>>>> > >
>>>> > > On Wed, Jun 23, 2021 at 1:39 PM David Li <lidav...@apache.org>
>>>> wrote:
>>>> > >
>>>> > > > Ah to be clear, the API is indeed inconsistent - DoExchange was
>>>> added
>>>> > > some
>>>> > > > time later (and by its nature returning a FlightDataStream would
>>>> not
>>>> > have
>>>> > > > been possible, since it's meant to be able to interleave
>>>> > > reading/writing).
>>>> > > > But really, DoGet is indeed the odd one out in the C++ API and it
>>>> may
>>>> > be
>>>> > > > worth correcting. You could also perhaps imagine making a
>>>> > > FlightDataStream
>>>> > > > implementation that accepts a closure and provides it a fake
>>>> writer, if
>>>> > > the
>>>> > > > API mismatch is hard to work with...
>>>> > > >
>>>> > > > That said: this has some benefits, e.g. for a Python service that
>>>> > returns
>>>> > > > a Table, that means data can be fed into gRPC entirely in C++
>>>> without
>>>> > > > having to bounce into Python for each chunk.
>>>> > > >
>>>> > > > Best,
>>>> > > > David
>>>> > > >
>>>> > > > On Wed, Jun 23, 2021, at 15:33, Gosh Arzumanyan wrote:
>>>> > > > > Hi David,
>>>> > > > >
>>>> > > > > Got you. In fact I was looking at this more from the point of
>>>> view of
>>>> > > > consistency of the API in terms of "inputs" and thought
>>>> DoExchange is
>>>> > > kind
>>>> > > > of a DoGet+ so might make sense to have the same classes being
>>>> utilized
>>>> > > in
>>>> > > > both places. But again, I might be missing something and I get the
>>>> > point
>>>> > > > about breaking change.
>>>> > > > >
>>>> > > > > Cheers,
>>>> > > > > Gosh
>>>> > > > >
>>>> > > > > On Wed, Jun 23, 2021 at 2:58 PM David Li <lidav...@apache.org>
>>>> > wrote:
>>>> > > > >> __
>>>> > > > >> It's mostly a quirk of implementation (and just for
>>>> clarification,
>>>> > > > they're all nearly identical on the format/protocol level).
>>>> > > > >>
>>>> > > > >> DoGet is conceptualized as your application returning a
>>>> readable
>>>> > > stream
>>>> > > > of batches, instead of your application imperatively writing
>>>> batches to
>>>> > > the
>>>> > > > client. (This is different than how Flight is implemented in
>>>> Java.) You
>>>> > > > would normally not implement FlightDataStream - you would return a
>>>> > > > RecordBatchStream.
>>>> > > > >>
>>>> > > > >> DoGet could not have FlightMessageWriter as a return type as
>>>> that
>>>> > > > wouldn't make sense, but it could accept an instance of that as a
>>>> > > parameter
>>>> > > > instead, much like DoExchange. That would be a breaking change.
>>>> > > > >>
>>>> > > > >> Best,
>>>> > > > >> David
>>>> > > > >>
>>>> > > > >> On Wed, Jun 23, 2021, at 08:47, Gosh Arzumanyan wrote:
>>>> > > > >>> Hi David,
>>>> > > > >>>
>>>> > > > >>> Going through the ArrowFlight API: got confused a bit on
>>>> DoGet and
>>>> > > > >>> DoPut/DoExachange apis: the former one expects
>>>> FlightDataStream
>>>> > which
>>>> > > > talks
>>>> > > > >>> in already serialized message terms while the latter to
>>>> > > > >>> accept FlightMessageReader/Writer which expect the user to
>>>> pass in
>>>> > > > >>> RecordBatches etc. Is there any reason why the DoGet can't
>>>> have
>>>> > > > >>> FlightMessageWriter as a return type?
>>>> > > > >>>
>>>> > > > >>> Cheers,
>>>> > > > >>> Gosh
>>>> > > > >>>
>>>> > > > >>> On Mon, Jun 21, 2021 at 9:47 PM Gosh Arzumanyan <
>>>> gosh...@gmail.com
>>>> > >
>>>> > > > wrote:
>>>> > > > >>>
>>>> > > > >>> > Thanks David!
>>>> > > > >>> >
>>>> > > > >>> > I also responded/added more suggestions/questions to the
>>>> doc. I
>>>> > > > think it
>>>> > > > >>> > makes sense to have two sections: one purely protocol
>>>> oriented
>>>> > and
>>>> > > > second
>>>> > > > >>> > API oriented(examples in c++ or in any other language
>>>> should make
>>>> > > > the idea
>>>> > > > >>> > easier to digest).
>>>> > > > >>> >
>>>> > > > >>> > Thanks for the reference too!
>>>> > > > >>> >
>>>> > > > >>> > Cheers,
>>>> > > > >>> > Gosh
>>>> > > > >>> >
>>>> > > > >>> > On Mon, Jun 21, 2021 at 4:41 PM David Li <
>>>> lidav...@apache.org>
>>>> > > > wrote:
>>>> > > > >>> >
>>>> > > > >>> >> Thanks! I've left some initial comments/suggestions to
>>>> expand it
>>>> > > in
>>>> > > > terms
>>>> > > > >>> >> of the format definitions and not the C++ APIs.
>>>> > > > >>> >>
>>>> > > > >>> >> I'll also note something like this was proposed a long
>>>> time ago
>>>> > -
>>>> > > > there's
>>>> > > > >>> >> not very much discussion about it there but for reference:
>>>> > > > >>> >>
>>>> > > >
>>>> > >
>>>> >
>>>> https://lists.apache.org/thread.html/0e5ba78c48cdd0e357f3a4a6d8affd31767c34376b62c001910823af%40%3Cdev.arrow.apache.org%3E
>>>> > > > >>> >> (or see the thread '[Discuss][FlightRPC] Extensions to
>>>> Flight:
>>>> > > > >>> >> "DoBidirectional"' from 2019-2020). It might be good to
>>>> address
>>>> > > why
>>>> > > > the
>>>> > > > >>> >> proposed workaround there (union-of-structs) is
>>>> insufficient for
>>>> > > > the use
>>>> > > > >>> >> cases here (and in FlightSQL).
>>>> > > > >>> >>
>>>> > > > >>> >> -David
>>>> > > > >>> >>
>>>> > > > >>> >> On Mon, Jun 21, 2021, at 08:22, Gosh Arzumanyan wrote:
>>>> > > > >>> >> > Ah sorry, comments should work now.
>>>> > > > >>> >> >
>>>> > > > >>> >> > Cheers,
>>>> > > > >>> >> > Gosh
>>>> > > > >>> >> >
>>>> > > > >>> >> > On Mon., 21 Jun. 2021, 14:18 David Li, <
>>>> lidav...@apache.org
>>>> > > > <mailto:
>>>> > > > >>> >> lidavidm%40apache.org>> wrote:
>>>> > > > >>> >> >
>>>> > > > >>> >> > > Thanks! Will give it a look.
>>>> > > > >>> >> > >
>>>> > > > >>> >> > > Would you mind opening it up for comments?
>>>> > > > >>> >> > >
>>>> > > > >>> >> > > -David
>>>> > > > >>> >> > >
>>>> > > > >>> >> > > On 2021/06/21 11:56:24, Gosh Arzumanyan <
>>>> gosh...@gmail.com
>>>> > > > <mailto:
>>>> > > > >>> >> gosharz%40gmail.com>> wrote:
>>>> > > > >>> >> > > > Hi folks,
>>>> > > > >>> >> > > >
>>>> > > > >>> >> > > > Started putting some thoughts together here:
>>>> > > > >>> >> > > >
>>>> > > > >>> >> > >
>>>> > > > >>> >>
>>>> > > >
>>>> > >
>>>> >
>>>> https://docs.google.com/document/d/1dIOpKNYwsd9sdChsRBAx37BiJXl_7enpwWkH76n1tOI/edit?usp=sharing
>>>> > > > >>> >> > > > Any feedback is welcome!
>>>> > > > >>> >> > > >
>>>> > > > >>> >> > > > Cheers,
>>>> > > > >>> >> > > > Gosh
>>>> > > > >>> >> > > >
>>>> > > > >>> >> > >
>>>> > > > >>> >> >
>>>> > > > >>> >>
>>>> > > > >>> >
>>>> > > > >>>
>>>> > > > >>
>>>> > > >
>>>> > >
>>>> > >
>>>> > > --
>>>> > >
>>>> > >
>>>> > >
>>>> >
>>>>
>>>

--

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Reply via email to