Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

David Li Tue, 13 Apr 2021 08:13:49 -0700

Thanks for the details. I'll note a few things, but adding schema
evolution to Flight is reasonable, if you'd like to put together a
proposal for discussion (possibly in coordination with the
Deephaven/Barrage team, if they're also still interested).

>    3. Assume that there is a strong reason to query A1,..,AK together.

While I don't know the details here, at least with Flight/gRPC, it's
not necessarily expensive to make several requests to the same server,
as gRPC will consolidate them into the same underlying network
connection. You could issue one GetFlightInfo request for all streams
at once, and get back a list of endpoints for each individual
subquery, which you could then issue separate DoGet requests for.

There's a slight mismatch there in that GetFlightInfo returns a
FlightInfo, which assumes all endpoints have the same schema. But for
a specific application, you could ignore that field (nothing in Flight
checks that schema against the actual data).

Of course, if said strong reason is that all the data is really
retrieved together despite being distinct datasets, then this would
complicate the server side implementation quite a bit. But it's one
option.

> A potential way to address this(with the existing tools) could be having a
> union schema of all fields across all entities(potentially prefixed with
> the field name just like in sql joins) and setting the values to NA which
> do not belong to an entity.

I had a similar use case in the past, and it was suggested to use
Arrow's Union type which handles this directly. A Union of Struct
types essentially lets you have multiple distinct schemas all encoded
in the same overall table, with explicit information about which
schema is currently in use. But as you point out this isn't helpful if
you don't know all the schemas up front.

Best,
David

On 2021/04/13 11:21:20, Gosh Arzumanyan <[email protected]> wrote: 
> Hi David,
> 
> Thanks for sharing the link!
> 
> Here is how a potential use case might look like:
> 
>    1. Assume that we have a service S which accepts expressions in some
>    language X.
>    2. Assume that a typical query to this service requests entities A_1,
>    A_2,..,A_K. Each of those entities generates a stream of record batches.
>    Record batches for a single A_I share the same schema, yet there is no
>    guarantee that schemas are equal across all streams.
>    3. Assume that there is a strong reason to query A1,..,AK together.
>    4. Service generates record batches(concurrently), tags those(e.g. with
>    schema level metadata) and sends them over.
> 
> A potential way to address this(with the existing tools) could be having a
> union schema of all fields across all entities(potentially prefixed with
> the field name just like in sql joins) and setting the values to NA which
> do not belong to an entity. However this solution might not work in cases
> where we are not able to construct the unified schema before opening the
> stream(e.g. in case of changes in the schema for a specific entity upon
> realtime input feeding or an unpredictable generator expression).
> 
> Cheers,
> Gosh
> 
> 
> On Mon., 12 Apr. 2021, 13:45 David Li, <[email protected]> wrote:
> 
> > Hi Gosh,
> >
> > There was indeed a discussion where schema evolution was proposed as a
> > solution for another use case:
> >
> > https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E
> >
> > I am curious though, what is your use case here?
> >
> > Best,
> > David
> >
> > On 2021/04/12 10:49:00, Gosh Arzumanyan <[email protected]> wrote:
> > > Hi guys, hope you are well!
> > >
> > > Judging from the Flight API
> > > <
> > https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461
> > >
> > > and
> > > from the documentation/examples out there, it seems like data schema is
> > > supposed to be fixed per stream in ArrowFlight(which is also aligned with
> > > corresponding IPC stream writers/readers).
> > > Wondering if the community has evaluated the necessity/possibility of
> > > supporting schema changes within a single stream(I do recall seeing a
> > > discussion on this somewhere but can't find it)?
> > >
> > > Cheers,
> > > Gosh
> > >
> >
>

Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Reply via email to