Thank you for doing this, left a few questions on the GH issue I would adopt this proposal as soon as it makes it into nightlies (or possibly earlier if it's just a matter of regenerating the proto definitions)
The operation flow would be like this, or what would it look like? Client ---> GetFlightInfo (query/update operation in payload) ---> Server ---> Results (non-streamed) On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Some people have already left comments on > https://github.com/apache/arrow/pull/12571 More eyes on it would be > appreciated. If there aren't more comments, I'll try to start implementing > this feature in Flight next week, and hopefully have a vote after it is > supported in Java and C++/Python. > > > Thanks, > Micah > > On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > I put together straw-man proposal in PR [1] for the Flight changes. > > Ultimately, it seemed based on the use-cases discussed inlining the data > on > > the Ticket made the most sense. This might be overly complex (I'm not > sure > > how I feel about a enum indicating partial vs full results) but welcome > > feedback. Once we get consensus on this proposal, I can add changes to > > Flight SQL and try to provide reference implementations. > > > > [1] https://github.com/apache/arrow/pull/12571 > > > > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >> Would it make sense to make this part of DoGet since it > >>> still would be returning a record batch > >> > >> I would lean against this. I think in many cases the client doesn't know > >> the size of the data that it expects. Leaving the flexibility on the > >> server side to send back inlined data when it thinks it makes sense, or > a > >> bunch of tickets when there is in fact a lot of data seems like the best > >> option here. > >> > >> For cases like previewing data, you usually just want to get a small > >>> amount > >>> of data quickly. > >> > >> This is interesting and might be an additional use case. If we did > >> decide to extend FlightInfo we might also want a way of annotating > inlined > >> data with its corresponding ticket. That way even for large results, > you > >> could still send back a small preview if desired. > >> > >> After considering it a little bit I think I'm sold that inlined data > >> should not replace a ticket. So in my mind the open question is whether > >> the client needs to actively opt-in to inlined data. The scenarios I > could > >> come with where inlined data isn't useful are: > >> 1. The client is an old client and isn't aware inline data might be > >> returned. In this case the main cost is of extra data on the wire and > >> storing it as unknown fields [1]. > >> 2. The client is a new client but still doesn't want to get inline data > >> (it might want to distribute all consumption to other processes). Same > >> cost is paid as option 1. > >> > >> Are there other scenarios? If servers choose reasonable limits on what > >> data to inline, the extra complexity of negotiating with the client in > this > >> case might not be worth the benefits. > >> > >> Cheers, > >> Micah > >> > >> > >> [1] https://developers.google.com/protocol-buffers/docs/proto3#unknowns > >> > >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> wrote: > >> > >>> I think this would be a useful feature and be nice to have in Flight > >>> core. > >>> For cases like previewing data, you usually just want to get a small > >>> amount > >>> of data quickly. Would it make sense to make this part of DoGet since > it > >>> still would be returning a record batch? Perhaps a Ticket could be made > >>> to > >>> have an optional FlightDescriptor that would serve as an all-in-one > shot? > >>> > >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> wrote: > >>> > >>> > I agree with something along Antoine's proposal, though: maybe we > >>> should > >>> > be more structured with the flags (akin to what Micah mentioned with > >>> the > >>> > Feature enum). > >>> > > >>> > Also, the flag could be embedded into the Flight SQL messages > instead. > >>> (So > >>> > in effect, Flight would only add the capability to return data with > >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide > how > >>> > they want to take advantage of that.) > >>> > > >>> > I think having a completely separate method and return type and > having > >>> to > >>> > poll for it beforehand somewhat defeats the purpose of having > it/would > >>> be > >>> > much harder of a transition. > >>> > > >>> > Also: it should be `repeated FlightInfo inline_data` right? In case > we > >>> > also need dictionary batches? > >>> > > >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: > >>> > > Can we just add the following field to the FlightDescriptor > message: > >>> > > > >>> > > bool accept_inline_data = 4; > >>> > > > >>> > > and this one to the FlightInfo message: > >>> > > > >>> > > FlightData inline_data = 100; > >>> > > > >>> > > Then new clients can `accept_inline_data` to true (the default > being > >>> > > false if omitted) to signal servers that they can put the data if > >>> > > `inline_data` if deemed small enough. > >>> > > > >>> > > (the `accept_inline_data` field could also be used to the Criteria > >>> > > message) > >>> > > > >>> > > > >>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty > >>> > > (FlightDescriptor being used in other contexts where > >>> > > `accept_inline_data` makes no sense), we can instead define a new > >>> > > method: > >>> > > > >>> > > rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) {} > >>> > > > >>> > > with: > >>> > > > >>> > > message GetFlightInfoRequest { > >>> > > FlightDescriptor flight_descriptor = 1; > >>> > > bool accept_inline_data = 2; > >>> > > } > >>> > > > >>> > > Regards > >>> > > > >>> > > Antoine. > >>> > > > >>> > > > >>> > > On Mon, 28 Feb 2022 11:29:12 -0800 > >>> > > James Duong <jam...@bitquilltech.com.INVALID> wrote: > >>> > >> This seems reasonable, however we need to account for existing > >>> Flight > >>> > >> clients that were written before this. > >>> > >> > >>> > >> It seems like the server will need to still handle the ticket > >>> returned > >>> > for > >>> > >> getStream() for clients that are unaware of the small result > >>> > optimization. > >>> > >> > >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <lidav...@apache.org> > >>> wrote: > >>> > >> > >>> > >> > Ah, that makes more sense, that would be a reasonable extension > to > >>> > Flight > >>> > >> > overall. (While we're at it, I think it would help to have an > >>> > app_metadata > >>> > >> > field in FlightInfo as well.) > >>> > >> > > >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote: > >>> > >> > >> > >>> > >> > >> But it seems reasonable to add a one-shot query path using > >>> DoGet. > >>> > >> > > > >>> > >> > > > >>> > >> > > I was thinking more of adding a bytes field to FlightInfo that > >>> > could > >>> > >> > store > >>> > >> > > arrow data. That way GetFlightInfo would be the only RPC > >>> necessary > >>> > for > >>> > >> > > small results when executing a CMD. The client doesn't > >>> necessarily > >>> > know > >>> > >> > > whether a query will return large or small results. > >>> > >> > > > >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li < > lidav...@apache.org> > >>> > wrote: > >>> > >> > > > >>> > >> > >> I think the focus was on large result sets (though I don't > >>> recall > >>> > this > >>> > >> > >> being discussed before) and supporting multi-node setups > (hence > >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems reasonable > to > >>> add > >>> > a > >>> > >> > >> one-shot query path using DoGet. > >>> > >> > >> > >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote: > >>> > >> > >> > I saw the same. A small, stateless query ability would be > >>> nice > >>> > >> > >> (connection > >>> > >> > >> > open, initialization, query in one message, the resultset > in > >>> > the > >>> > >> > response > >>> > >> > >> > in one message) > >>> > >> > >> > > >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield < > >>> > emkornfi...@gmail.com> > >>> > >> > >> wrote: > >>> > >> > >> > > >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not > sure > >>> if > >>> > I'm > >>> > >> > >> missing > >>> > >> > >> >> it but is there any optimization for small results? My > >>> concern > >>> > is > >>> > >> > that > >>> > >> > >> the > >>> > >> > >> >> overhead of the RPCs for the DoGet after executing the > query > >>> > could > >>> > >> > add > >>> > >> > >> >> non-trivial latency for smaller results. > >>> > >> > >> >> > >>> > >> > >> >> Has anybody else thought about this/investigated it? Am I > >>> > >> > understanding > >>> > >> > >> >> this correctly? > >>> > >> > >> >> > >>> > >> > >> >> Thanks, > >>> > >> > >> >> Micah > >>> > >> > >> >> > >>> > >> > >> > >>> > >> > > >>> > >> > >>> > >> > >>> > > >>> > >> >