Yes, next step is implementation which I've been delayed on. I hope to have a little time this week to work on it and will post an update.
On Wed, Apr 27, 2022 at 7:48 AM David Li <lidav...@apache.org> wrote: > Following up here - what are the next steps? The RFC PR looks fairly > complete, maybe we can help build out implementations in C++/Java/other > languages in preparation for a vote? > > On Wed, Mar 9, 2022, at 00:23, Micah Kornfield wrote: > >> > >> The operation flow would be like this, or what would it look like? > >> Client ---> GetFlightInfo (query/update operation in payload) ---> > Server > >> ---> Results (non-streamed) > > > > > > This is roughly the flow I was imagining if the server chooses to send > back > > inlined data. > > > > -Micah > > > > On Tue, Mar 8, 2022 at 11:27 AM Gavin Ray <ray.gavi...@gmail.com> wrote: > > > >> Thank you for doing this, left a few questions on the GH issue > >> > >> I would adopt this proposal as soon as it makes it into nightlies > >> (or possibly earlier if it's just a matter of regenerating the proto > >> definitions) > >> > >> The operation flow would be like this, or what would it look like? > >> > >> Client ---> GetFlightInfo (query/update operation in payload) ---> > Server > >> ---> Results (non-streamed) > >> > >> > >> > >> > >> On Tue, Mar 8, 2022 at 2:04 PM Micah Kornfield <emkornfi...@gmail.com> > >> wrote: > >> > >>> Some people have already left comments on > >>> https://github.com/apache/arrow/pull/12571 More eyes on it would be > >>> appreciated. If there aren't more comments, I'll try to start > >>> implementing > >>> this feature in Flight next week, and hopefully have a vote after it is > >>> supported in Java and C++/Python. > >>> > >>> > >>> Thanks, > >>> Micah > >>> > >>> On Fri, Mar 4, 2022 at 10:54 PM Micah Kornfield <emkornfi...@gmail.com > > > >>> wrote: > >>> > >>> > I put together straw-man proposal in PR [1] for the Flight changes. > >>> > Ultimately, it seemed based on the use-cases discussed inlining the > >>> data on > >>> > the Ticket made the most sense. This might be overly complex (I'm > not > >>> sure > >>> > how I feel about a enum indicating partial vs full results) but > welcome > >>> > feedback. Once we get consensus on this proposal, I can add changes > to > >>> > Flight SQL and try to provide reference implementations. > >>> > > >>> > [1] https://github.com/apache/arrow/pull/12571 > >>> > > >>> > On Tue, Mar 1, 2022 at 10:51 PM Micah Kornfield < > emkornfi...@gmail.com> > >>> > wrote: > >>> > > >>> >> Would it make sense to make this part of DoGet since it > >>> >>> still would be returning a record batch > >>> >> > >>> >> I would lean against this. I think in many cases the client doesn't > >>> know > >>> >> the size of the data that it expects. Leaving the flexibility on > the > >>> >> server side to send back inlined data when it thinks it makes sense, > >>> or a > >>> >> bunch of tickets when there is in fact a lot of data seems like the > >>> best > >>> >> option here. > >>> >> > >>> >> For cases like previewing data, you usually just want to get a small > >>> >>> amount > >>> >>> of data quickly. > >>> >> > >>> >> This is interesting and might be an additional use case. If we did > >>> >> decide to extend FlightInfo we might also want a way of annotating > >>> inlined > >>> >> data with its corresponding ticket. That way even for large > results, > >>> you > >>> >> could still send back a small preview if desired. > >>> >> > >>> >> After considering it a little bit I think I'm sold that inlined data > >>> >> should not replace a ticket. So in my mind the open question is > >>> whether > >>> >> the client needs to actively opt-in to inlined data. The scenarios > I > >>> could > >>> >> come with where inlined data isn't useful are: > >>> >> 1. The client is an old client and isn't aware inline data might be > >>> >> returned. In this case the main cost is of extra data on the wire > and > >>> >> storing it as unknown fields [1]. > >>> >> 2. The client is a new client but still doesn't want to get inline > >>> data > >>> >> (it might want to distribute all consumption to other processes). > Same > >>> >> cost is paid as option 1. > >>> >> > >>> >> Are there other scenarios? If servers choose reasonable limits on > what > >>> >> data to inline, the extra complexity of negotiating with the client > in > >>> this > >>> >> case might not be worth the benefits. > >>> >> > >>> >> Cheers, > >>> >> Micah > >>> >> > >>> >> > >>> >> [1] > >>> https://developers.google.com/protocol-buffers/docs/proto3#unknowns > >>> >> > >>> >> On Tue, Mar 1, 2022 at 10:01 PM Bryan Cutler <cutl...@gmail.com> > >>> wrote: > >>> >> > >>> >>> I think this would be a useful feature and be nice to have in > Flight > >>> >>> core. > >>> >>> For cases like previewing data, you usually just want to get a > small > >>> >>> amount > >>> >>> of data quickly. Would it make sense to make this part of DoGet > since > >>> it > >>> >>> still would be returning a record batch? Perhaps a Ticket could be > >>> made > >>> >>> to > >>> >>> have an optional FlightDescriptor that would serve as an all-in-one > >>> shot? > >>> >>> > >>> >>> On Tue, Mar 1, 2022 at 8:44 AM David Li <lidav...@apache.org> > wrote: > >>> >>> > >>> >>> > I agree with something along Antoine's proposal, though: maybe we > >>> >>> should > >>> >>> > be more structured with the flags (akin to what Micah mentioned > with > >>> >>> the > >>> >>> > Feature enum). > >>> >>> > > >>> >>> > Also, the flag could be embedded into the Flight SQL messages > >>> instead. > >>> >>> (So > >>> >>> > in effect, Flight would only add the capability to return data > with > >>> >>> > FlightInfo, and it's up to applications, like Flight SQL, to > decide > >>> how > >>> >>> > they want to take advantage of that.) > >>> >>> > > >>> >>> > I think having a completely separate method and return type and > >>> having > >>> >>> to > >>> >>> > poll for it beforehand somewhat defeats the purpose of having > >>> it/would > >>> >>> be > >>> >>> > much harder of a transition. > >>> >>> > > >>> >>> > Also: it should be `repeated FlightInfo inline_data` right? In > case > >>> we > >>> >>> > also need dictionary batches? > >>> >>> > > >>> >>> > On Tue, Mar 1, 2022, at 11:39, Antoine Pitrou wrote: > >>> >>> > > Can we just add the following field to the FlightDescriptor > >>> message: > >>> >>> > > > >>> >>> > > bool accept_inline_data = 4; > >>> >>> > > > >>> >>> > > and this one to the FlightInfo message: > >>> >>> > > > >>> >>> > > FlightData inline_data = 100; > >>> >>> > > > >>> >>> > > Then new clients can `accept_inline_data` to true (the default > >>> being > >>> >>> > > false if omitted) to signal servers that they can put the data > if > >>> >>> > > `inline_data` if deemed small enough. > >>> >>> > > > >>> >>> > > (the `accept_inline_data` field could also be used to the > Criteria > >>> >>> > > message) > >>> >>> > > > >>> >>> > > > >>> >>> > > Alternatively, if the FlightDescriptor expansion looks a bit > dirty > >>> >>> > > (FlightDescriptor being used in other contexts where > >>> >>> > > `accept_inline_data` makes no sense), we can instead define a > new > >>> >>> > > method: > >>> >>> > > > >>> >>> > > rpc GetFlightInfoEx(GetFlightInfoRequest) returns > (FlightInfo) > >>> {} > >>> >>> > > > >>> >>> > > with: > >>> >>> > > > >>> >>> > > message GetFlightInfoRequest { > >>> >>> > > FlightDescriptor flight_descriptor = 1; > >>> >>> > > bool accept_inline_data = 2; > >>> >>> > > } > >>> >>> > > > >>> >>> > > Regards > >>> >>> > > > >>> >>> > > Antoine. > >>> >>> > > > >>> >>> > > > >>> >>> > > On Mon, 28 Feb 2022 11:29:12 -0800 > >>> >>> > > James Duong <jam...@bitquilltech.com.INVALID> wrote: > >>> >>> > >> This seems reasonable, however we need to account for existing > >>> >>> Flight > >>> >>> > >> clients that were written before this. > >>> >>> > >> > >>> >>> > >> It seems like the server will need to still handle the ticket > >>> >>> returned > >>> >>> > for > >>> >>> > >> getStream() for clients that are unaware of the small result > >>> >>> > optimization. > >>> >>> > >> > >>> >>> > >> On Mon, Feb 28, 2022 at 11:26 AM David Li < > lidav...@apache.org> > >>> >>> wrote: > >>> >>> > >> > >>> >>> > >> > Ah, that makes more sense, that would be a reasonable > >>> extension to > >>> >>> > Flight > >>> >>> > >> > overall. (While we're at it, I think it would help to have > an > >>> >>> > app_metadata > >>> >>> > >> > field in FlightInfo as well.) > >>> >>> > >> > > >>> >>> > >> > On Mon, Feb 28, 2022, at 14:24, Micah Kornfield wrote: > >>> >>> > >> > >> > >>> >>> > >> > >> But it seems reasonable to add a one-shot query path > using > >>> >>> DoGet. > >>> >>> > >> > > > >>> >>> > >> > > > >>> >>> > >> > > I was thinking more of adding a bytes field to FlightInfo > >>> that > >>> >>> > could > >>> >>> > >> > store > >>> >>> > >> > > arrow data. That way GetFlightInfo would be the only RPC > >>> >>> necessary > >>> >>> > for > >>> >>> > >> > > small results when executing a CMD. The client doesn't > >>> >>> necessarily > >>> >>> > know > >>> >>> > >> > > whether a query will return large or small results. > >>> >>> > >> > > > >>> >>> > >> > > On Mon, Feb 28, 2022 at 11:04 AM David Li < > >>> lidav...@apache.org> > >>> >>> > wrote: > >>> >>> > >> > > > >>> >>> > >> > >> I think the focus was on large result sets (though I > don't > >>> >>> recall > >>> >>> > this > >>> >>> > >> > >> being discussed before) and supporting multi-node setups > >>> (hence > >>> >>> > >> > >> GetFlightInfo/DoGet are separated). But it seems > reasonable > >>> to > >>> >>> add > >>> >>> > a > >>> >>> > >> > >> one-shot query path using DoGet. > >>> >>> > >> > >> > >>> >>> > >> > >> On Mon, Feb 28, 2022, at 13:32, Adam Lippai wrote: > >>> >>> > >> > >> > I saw the same. A small, stateless query ability would > be > >>> >>> nice > >>> >>> > >> > >> (connection > >>> >>> > >> > >> > open, initialization, query in one message, the > resultset > >>> in > >>> >>> > the > >>> >>> > >> > response > >>> >>> > >> > >> > in one message) > >>> >>> > >> > >> > > >>> >>> > >> > >> > On Mon, Feb 28, 2022, 13:12 Micah Kornfield < > >>> >>> > emkornfi...@gmail.com> > >>> >>> > >> > >> wrote: > >>> >>> > >> > >> > > >>> >>> > >> > >> >> I'm rereviewing the Flight SQL interfaces, and I'm not > >>> sure > >>> >>> if > >>> >>> > I'm > >>> >>> > >> > >> missing > >>> >>> > >> > >> >> it but is there any optimization for small results? > My > >>> >>> concern > >>> >>> > is > >>> >>> > >> > that > >>> >>> > >> > >> the > >>> >>> > >> > >> >> overhead of the RPCs for the DoGet after executing the > >>> query > >>> >>> > could > >>> >>> > >> > add > >>> >>> > >> > >> >> non-trivial latency for smaller results. > >>> >>> > >> > >> >> > >>> >>> > >> > >> >> Has anybody else thought about this/investigated it? > Am > >>> I > >>> >>> > >> > understanding > >>> >>> > >> > >> >> this correctly? > >>> >>> > >> > >> >> > >>> >>> > >> > >> >> Thanks, > >>> >>> > >> > >> >> Micah > >>> >>> > >> > >> >> > >>> >>> > >> > >> > >>> >>> > >> > > >>> >>> > >> > >>> >>> > >> > >>> >>> > > >>> >>> > >>> >> > >>> > >> >