Thanks for the details. I'll note a few things, but adding schema evolution to Flight is reasonable, if you'd like to put together a proposal for discussion (possibly in coordination with the Deephaven/Barrage team, if they're also still interested).
> 3. Assume that there is a strong reason to query A1,..,AK together. While I don't know the details here, at least with Flight/gRPC, it's not necessarily expensive to make several requests to the same server, as gRPC will consolidate them into the same underlying network connection. You could issue one GetFlightInfo request for all streams at once, and get back a list of endpoints for each individual subquery, which you could then issue separate DoGet requests for. There's a slight mismatch there in that GetFlightInfo returns a FlightInfo, which assumes all endpoints have the same schema. But for a specific application, you could ignore that field (nothing in Flight checks that schema against the actual data). Of course, if said strong reason is that all the data is really retrieved together despite being distinct datasets, then this would complicate the server side implementation quite a bit. But it's one option. > A potential way to address this(with the existing tools) could be having a > union schema of all fields across all entities(potentially prefixed with > the field name just like in sql joins) and setting the values to NA which > do not belong to an entity. I had a similar use case in the past, and it was suggested to use Arrow's Union type which handles this directly. A Union of Struct types essentially lets you have multiple distinct schemas all encoded in the same overall table, with explicit information about which schema is currently in use. But as you point out this isn't helpful if you don't know all the schemas up front. Best, David On 2021/04/13 11:21:20, Gosh Arzumanyan <gosh...@gmail.com> wrote: > Hi David, > > Thanks for sharing the link! > > Here is how a potential use case might look like: > > 1. Assume that we have a service S which accepts expressions in some > language X. > 2. Assume that a typical query to this service requests entities A_1, > A_2,..,A_K. Each of those entities generates a stream of record batches. > Record batches for a single A_I share the same schema, yet there is no > guarantee that schemas are equal across all streams. > 3. Assume that there is a strong reason to query A1,..,AK together. > 4. Service generates record batches(concurrently), tags those(e.g. with > schema level metadata) and sends them over. > > A potential way to address this(with the existing tools) could be having a > union schema of all fields across all entities(potentially prefixed with > the field name just like in sql joins) and setting the values to NA which > do not belong to an entity. However this solution might not work in cases > where we are not able to construct the unified schema before opening the > stream(e.g. in case of changes in the schema for a specific entity upon > realtime input feeding or an unpredictable generator expression). > > Cheers, > Gosh > > > On Mon., 12 Apr. 2021, 13:45 David Li, <lidav...@apache.org> wrote: > > > Hi Gosh, > > > > There was indeed a discussion where schema evolution was proposed as a > > solution for another use case: > > > > https://lists.apache.org/thread.html/re800c63f0eb08022c8cd5e1b2236fd69a2e85afdc34daf6b75e3b7b3%40%3Cdev.arrow.apache.org%3E > > > > I am curious though, what is your use case here? > > > > Best, > > David > > > > On 2021/04/12 10:49:00, Gosh Arzumanyan <gosh...@gmail.com> wrote: > > > Hi guys, hope you are well! > > > > > > Judging from the Flight API > > > < > > https://github.com/apache/arrow/blob/5b08205f7e864ed29f53ed3d836845fed62d5d4a/cpp/src/arrow/flight/types.h#L461 > > > > > > and > > > from the documentation/examples out there, it seems like data schema is > > > supposed to be fixed per stream in ArrowFlight(which is also aligned with > > > corresponding IPC stream writers/readers). > > > Wondering if the community has evaluated the necessity/possibility of > > > supporting schema changes within a single stream(I do recall seeing a > > > discussion on this somewhere but can't find it)? > > > > > > Cheers, > > > Gosh > > > > > >