Following up here - what are the next steps? The RFC PR looks fairly complete, maybe we can help build out implementations in C++/Java/other languages in preparation for a vote?
On Wed, Mar 9, 2022, at 00:23, Micah Kornfield wrote: >> >> The operation flow would be like this, or what would it look like? >> Client ---> GetFlightInfo (query/update operation in payload) ---> Server >> ---> Results (non-streamed) > > > This is roughly the flow I was imagining if the server chooses to send back > inlined data. > > -Micah > > On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray <ray.gavi...@gmail.com> wrote: > >> Thank you for doing this, left a few questions on the GH issue >> >> I would adopt this proposal as soon as it makes it into nightlies >> (or possibly earlier if it's just a matter of regenerating the proto >> definitions) >> >> The operation flow would be like this, or what would it look like? >> >> Client ---> GetFlightInfo (query/update operation in payload) ---> Server >> ---> Results (non-streamed) >> >> >> >> >> On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> Some people have already left comments on >>> https://github.com/apache/arrow/pull/12571 More eyes on it would be >>> appreciated. If there aren't more comments, I'll try to start >>> implementing >>> this feature in Flight next week, and hopefully have a vote after it is >>> supported in Java and C++/Python. >>> >>> >>> Thanks, >>> Micah >>> >>> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <emkornfi...@gmail.com> >>> wrote: >>> >>> > I put together straw-man proposal in PR [1] for the Flight changes. >>> > Ultimately, it seemed based on the use-cases discussed inlining the >>> data on >>> > the Ticket made the most sense. This might be overly complex (I'm not >>> sure >>> > how I feel about a enum indicating partial vs full results) but welcome >>> > feedback. Once we get consensus on this proposal, I can add changes to >>> > Flight SQL and try to provide reference implementations. >>> > >>> > [1] https://github.com/apache/arrow/pull/12571 >>> > >>> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield <emkornfi...@gmail.com> >>> > wrote: >>> > >>> >> Would it make sense to make this part of DoGet since it >>> >>> still would be returning a record batch >>> >> >>> >> I would lean against this. I think in many cases the client doesn't >>> know >>> >> the size of the data that it expects. Leaving the flexibility on the >>> >> server side to send back inlined data when it thinks it makes sense, >>> or a >>> >> bunch of tickets when there is in fact a lot of data seems like the >>> best >>> >> option here. >>> >> >>> >> For cases like previewing data, you usually just want to get a small >>> >>> amount >>> >>> of data quickly. >>> >> >>> >> This is interesting and might be an additional use case. If we did >>> >> decide to extend FlightInfo we might also want a way of annotating >>> inlined >>> >> data with its corresponding ticket. That way even for large results, >>> you >>> >> could still send back a small preview if desired. >>> >> >>> >> After considering it a little bit I think I'm sold that inlined data >>> >> should not replace a ticket. So in my mind the open question is >>> whether >>> >> the client needs to actively opt-in to inlined data. The scenarios I >>> could >>> >> come with where inlined data isn't useful are: >>> >> 1. The client is an old client and isn't aware inline data might be >>> >> returned. In this case the main cost is of extra data on the wire and >>> >> storing it as unknown fields [1]. >>> >> 2. The client is a new client but still doesn't want to get inline >>> data >>> >> (it might want to distribute all consumption to other processes). Same >>> >> cost is paid as option 1. >>> >> >>> >> Are there other scenarios? If servers choose reasonable limits on what >>> >> data to inline, the extra complexity of negotiating with the client in >>> this >>> >> case might not be worth the benefits. >>> >> >>> >> Cheers, >>> >> Micah >>> >> >>> >> >>> >> [1] >>> https://developers.google.com/protocol-buffers/docs/proto3#unknowns >>> >> >>> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> >>> wrote: >>> >> >>> >>> I think this would be a useful feature and be nice to have in Flight >>> >>> core. >>> >>> For cases like previewing data, you usually just want to get a small >>> >>> amount >>> >>> of data quickly. Would it make sense to make this part of DoGet since >>> it >>> >>> still would be returning a record batch? Perhaps a Ticket could be >>> made >>> >>> to >>> >>> have an optional FlightDescriptor that would serve as an all-in-one >>> shot? >>> >>> >>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> wrote: >>> >>> >>> >>> > I agree with something along Antoine's proposal, though: maybe we >>> >>> should >>> >>> > be more structured with the flags (akin to what Micah mentioned with >>> >>> the >>> >>> > Feature enum). >>> >>> > >>> >>> > Also, the flag could be embedded into the Flight SQL messages >>> instead. >>> >>> (So >>> >>> > in effect, Flight would only add the capability to return data with >>> >>> > FlightInfo, and it's up to applications, like Flight SQL, to decide >>> how >>> >>> > they want to take advantage of that.) >>> >>> > >>> >>> > I think having a completely separate method and return type and >>> having >>> >>> to >>> >>> > poll for it beforehand somewhat defeats the purpose of having >>> it/would >>> >>> be >>> >>> > much harder of a transition. >>> >>> > >>> >>> > Also: it should be `repeated FlightInfo inline_data` right? In case >>> we >>> >>> > also need dictionary batches? >>> >>> > >>> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: >>> >>> > > Can we just add the following field to the FlightDescriptor >>> message: >>> >>> > > >>> >>> > > bool accept_inline_data = 4; >>> >>> > > >>> >>> > > and this one to the FlightInfo message: >>> >>> > > >>> >>> > > FlightData inline_data = 100; >>> >>> > > >>> >>> > > Then new clients can `accept_inline_data` to true (the default >>> being >>> >>> > > false if omitted) to signal servers that they can put the data if >>> >>> > > `inline_data` if deemed small enough. >>> >>> > > >>> >>> > > (the `accept_inline_data` field could also be used to the Criteria >>> >>> > > message) >>> >>> > > >>> >>> > > >>> >>> > > Alternatively, if the FlightDescriptor expansion looks a bit dirty >>> >>> > > (FlightDescriptor being used in other contexts where >>> >>> > > `accept_inline_data` makes no sense), we can instead define a new >>> >>> > > method: >>> >>> > > >>> >>> > > rpc GetFlightInfoEx(GetFlightInfoRequest) returns (FlightInfo) >>> {} >>> >>> > > >>> >>> > > with: >>> >>> > > >>> >>> > > message GetFlightInfoRequest { >>> >>> > > FlightDescriptor flight_descriptor = 1; >>> >>> > > bool accept_inline_data = 2; >>> >>> > > } >>> >>> > > >>> >>> > > Regards >>> >>> > > >>> >>> > > Antoine. >>> >>> > > >>> >>> > > >>> >>> > > On Mon, 28 Feb 2022 11:29:12 -0800 >>> >>> > > James Duong <jam...@bitquilltech.com.INVALID> wrote: >>> >>> > >> This seems reasonable, however we need to account for existing >>> >>> Flight >>> >>> > >> clients that were written before this. >>> >>> > >> >>> >>> > >> It seems like the server will need to still handle the ticket >>> >>> returned >>> >>> > for >>> >>> > >> getStream() for clients that are unaware of the small result >>> >>> > optimization. >>> >>> > >> >>> >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li <lidav...@apache.org> >>> >>> wrote: >>> >>> > >> >>> >>> > >> > Ah, that makes more sense, that would be a reasonable >>> extension to >>> >>> > Flight >>> >>> > >> > overall. (While we're at it, I think it would help to have an >>> >>> > app_metadata >>> >>> > >> > field in FlightInfo as well.) >>> >>> > >> > >>> >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote: >>> >>> > >> > >> >>> >>> > >> > >> But it seems reasonable to add a one-shot query path using >>> >>> DoGet. >>> >>> > >> > > >>> >>> > >> > > >>> >>> > >> > > I was thinking more of adding a bytes field to FlightInfo >>> that >>> >>> > could >>> >>> > >> > store >>> >>> > >> > > arrow data. That way GetFlightInfo would be the only RPC >>> >>> necessary >>> >>> > for >>> >>> > >> > > small results when executing a CMD. The client doesn't >>> >>> necessarily >>> >>> > know >>> >>> > >> > > whether a query will return large or small results. >>> >>> > >> > > >>> >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li < >>> lidav...@apache.org> >>> >>> > wrote: >>> >>> > >> > > >>> >>> > >> > >> I think the focus was on large result sets (though I don't >>> >>> recall >>> >>> > this >>> >>> > >> > >> being discussed before) and supporting multi-node setups >>> (hence >>> >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems reasonable >>> to >>> >>> add >>> >>> > a >>> >>> > >> > >> one-shot query path using DoGet. >>> >>> > >> > >> >>> >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote: >>> >>> > >> > >> > I saw the same. A small, stateless query ability would be >>> >>> nice >>> >>> > >> > >> (connection >>> >>> > >> > >> > open, initialization, query in one message, the resultset >>> in >>> >>> > the >>> >>> > >> > response >>> >>> > >> > >> > in one message) >>> >>> > >> > >> > >>> >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield < >>> >>> > emkornfi...@gmail.com> >>> >>> > >> > >> wrote: >>> >>> > >> > >> > >>> >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not >>> sure >>> >>> if >>> >>> > I'm >>> >>> > >> > >> missing >>> >>> > >> > >> >> it but is there any optimization for small results? My >>> >>> concern >>> >>> > is >>> >>> > >> > that >>> >>> > >> > >> the >>> >>> > >> > >> >> overhead of the RPCs for the DoGet after executing the >>> query >>> >>> > could >>> >>> > >> > add >>> >>> > >> > >> >> non-trivial latency for smaller results. >>> >>> > >> > >> >> >>> >>> > >> > >> >> Has anybody else thought about this/investigated it? Am >>> I >>> >>> > >> > understanding >>> >>> > >> > >> >> this correctly? >>> >>> > >> > >> >> >>> >>> > >> > >> >> Thanks, >>> >>> > >> > >> >> Micah >>> >>> > >> > >> >> >>> >>> > >> > >> >>> >>> > >> > >>> >>> > >> >>> >>> > >> >>> >>> > >>> >>> >>> >> >>> >>