Hi Greg, Thanks for taking a look!
1. I believe Chris suggested the use of a query parameter above as we already have precedents for using query parameters to configure per request behavior in Kafka Connect (the restart connectors API and the get connectors API for instance). Also, the limited choice of endpoints targeted is intentional (see my reply to the next point). 2. I intentionally targeted just the three listed endpoints where synchronous connector config validations come into the picture. This is because of the legitimate cases where config validation for specific connector plugins might exceed the default request timeout in edge case scenarios (outlined in the KIP's motivation section). Other Connect REST endpoints shouldn't be taking longer than the default 90 second request timeout; if they do so, it would either be indicative of a bug in the Connect framework or a cluster health issue - neither of which should be covered up by manually setting a longer request timeout. 3. 4. I think changing the config validation behavior would be a backward incompatible change and I wanted to avoid that in this particular KIP. There are multiple programmatic UIs out there which rely on the current synchronous config validation behavior and breaking the existing contract would definitely require a larger discussion. Thanks, Yash On Fri, Mar 3, 2023 at 12:04 AM Greg Harris <greg.har...@aiven.io.invalid> wrote: > Hey Yash, > > Thanks for the KIP, and sorry for the late review. > > 1. Have you considered a HTTP header to provide the client-configurable > timeout? A header might more naturally extend to all of the other endpoints > in the future, rather than duplicating the query parameter across > endpoints. > > 2. I understand that this change is targeted at just long duration > Connector::validation calls, is that due to voluntary scope constraints? > Implementing configurable timeouts for all endpoints in a uniform way could > be desirable, even if the default timeout will work for nearly all of the > other calls. > > 3. Did you consider adding asynchronous validation as a user-facing > feature? I think that relying on the synchronous validation results in a > single HTTP request is a bit of an anti-pattern, and one that we've > inherited from the original REST design. It seems useful when using the > REST API by hand, but seems to be a liability when used in environments > with an external management layer. > > 4. Perhaps rather than allowing synchronous calls to Connector:validate to > increase in duration, we should provide a way for connectors to surface > validation problems later in their lifecycle. Currently there is the > ConnectorContext::raiseError that transitions the task to FAILED, perhaps a > similar API could asynchronously emit re-validation results. We've also had > a problem with long start() duration for the same reasons as validate(). > > I understand if you want to keep this KIP as tightly focused as possible, > but i'm worried that it is addressing the symptom and not the problem. I > want to make sure that this change is impactful and isn't obsoleted by a > later improvement. > > Thanks, > Greg > > > On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <yash.ma...@gmail.com> wrote: > > > Hi all, > > > > Thanks for all the feedback and discussion. I've renamed the KIP title to > > "Kafka Connect REST API timeout improvements" since we're introducing a > > couple of improvements (cancelling create / update connector requests > when > > config validations timeout and avoiding double config validations in > > distributed mode) along with making the request timeouts configurable. > The > > new KIP link is > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements > > and I've called for a vote for the KIP here - > > https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp. > > > > Thanks, > > Yash > > > > On Sat, Nov 5, 2022 at 11:42 PM Sagar <sagarmeansoc...@gmail.com> wrote: > > > > > Hey Yash, > > > > > > Thanks for the explanation. I think it should be fine to delegate the > > > validation directly to the leader. > > > > > > Thanks! > > > Sagar. > > > > > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <yash.ma...@gmail.com> > wrote: > > > > > > > Hi Sagar, > > > > > > > > Thanks for chiming in! > > > > > > > > > Having said that, why does the worker forward to the > > > > > leader? I am thinking if the worker can perform the validation on > > it's > > > > own, > > > > > we could let it do the validation instead of forwarding everything > to > > > the > > > > > leader > > > > > > > > Only the leader is allowed to perform writes to the config topic, so > > any > > > > request that requires a config topic write ends up being forwarded to > > the > > > > leader. The `POST /connectors` and `PUT > /connectors/{connector}/config` > > > > endpoints call `Herder::putConnectorConfig` which internally does a > > > config > > > > validation first before initiating another herder request to write to > > the > > > > config topic (in distributed mode). If we want to allow the first > > worker > > > to > > > > do the config validation, and then forward the request to the leader > > just > > > > for the write to the config topic, we'd either need something like a > > skip > > > > validations query parameter on the endpoint like Chris talks about > > above > > > or > > > > else a new internal only endpoint that just does a write to the > config > > > > topic without any prior config validation. However, we agreed that > this > > > > optimization doesn't really seem necessary for now and can be done > > later > > > if > > > > deemed useful. I'd be happy to hear differing thoughts if any, > however. > > > > > > > > > I think a bound is certainly needed but IMO it shouldn't go beyond > > > > > 10 mins considering this is just validation > > > > > > > > Yeah, I agree that this seems like a fair value - I've elected to go > > > with a > > > > default value of 10 minutes for the proposed worker configuration > that > > > sets > > > > an upper bound for the timeout query parameter. > > > > > > > > Thanks, > > > > Yash > > > > > > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <yash.ma...@gmail.com> > > wrote: > > > > > > > > > Hi Chris, > > > > > > > > > > Thanks again for your feedback. I think a worker configuration for > > the > > > > > upper bound makes sense - I initially thought we could hardcode it > > > (just > > > > > like the current request timeout is), but there's no reason to set > > > > another > > > > > artificial bound that isn't user configurable which is exactly what > > > we're > > > > > trying to change here in the first place. I've updated the KIP > based > > on > > > > all > > > > > our discussion above. > > > > > > > > > > Thanks, > > > > > Yash > > > > > > > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sagarmeansoc...@gmail.com> > > > wrote: > > > > > > > > > >> Hey Yash, > > > > >> > > > > >> Thanks for the KIP! This looks like a useful feature. > > > > >> > > > > >> I think the discussion thread already has some great points by > > Chris. > > > > Just > > > > >> a couple of points/clarifications=> > > > > >> > > > > >> Regarding, pt#2 , I guess it might be better to forward to the > > leader > > > as > > > > >> suggested by Yash. Having said that, why does the worker forward > to > > > the > > > > >> leader? I am thinking if the worker can perform the validation on > > it's > > > > >> own, > > > > >> we could let it do the validation instead of forwarding everything > > to > > > > the > > > > >> leader(even though it might be cheap to forward all requests to > the > > > > >> leader). > > > > >> > > > > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't > go > > > > beyond > > > > >> 10 mins considering this is just validation. We shouldn't end up > in > > a > > > > >> situation where a few faulty connectors end up blocking a lot of > > > request > > > > >> processing threads, so while increasing the config is certainly > > > helpful, > > > > >> we > > > > >> shouldn't set too high a value IMO. Of course I am also open to > > > > >> suggestions > > > > >> here. > > > > >> > > > > >> Thanks! > > > > >> Sagar. > > > > >> > > > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton > > <chr...@aiven.io.invalid > > > > > > > > >> wrote: > > > > >> > > > > >> > Hi Yash, > > > > >> > > > > > >> > RE 2: That's a great point about validations already being > > performed > > > > by > > > > >> the > > > > >> > leader. For completeness's sake, I'd like to note that this only > > > holds > > > > >> for > > > > >> > valid configurations; invalid ones are caught right now before > > being > > > > >> > forwarded to the leader. Still, I think it's fine to forward to > > the > > > > >> leader > > > > >> > for now and optimize further in the future if necessary. If > > frequent > > > > >> > validations are taking place they should be conducted via the > `PUT > > > > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which > > > won't > > > > >> do > > > > >> > any forwarding at all. > > > > >> > > > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout > > also > > > > >> seem > > > > >> > reasonable... maybe a low-importance worker property could work > > for > > > > >> that? > > > > >> > Not sure what would make sense for a default; probably somewhere > > in > > > > the > > > > >> > 10-60 minute range but would be interested in others' thoughts. > > > > >> > > > > > >> > Thanks for the clarification on the zombie fencing logic. I > think > > we > > > > >> might > > > > >> > want to have some more subtle logic around the interaction > between > > > > >> calls to > > > > >> > Admin::fenceProducers and a worker-level timeout property if we > go > > > > that > > > > >> > route, but we can cross that particular bridge if we get back to > > it. > > > > >> > > > > > >> > Cheers, > > > > >> > > > > > >> > Chris > > > > >> > > > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <yash.ma...@gmail.com > > > > > > wrote: > > > > >> > > > > > >> > > Hi Chris, > > > > >> > > > > > > >> > > Thanks a lot for the super quick response and the great > > feedback! > > > > >> > > > > > > >> > > 1. I think that makes a lot of sense, and I'd be happy to > update > > > the > > > > >> KIP > > > > >> > to > > > > >> > > include this change in the scope. The current behavior where > the > > > API > > > > >> > > response indicates a time out but the connector is > > created/updated > > > > >> > > eventually anyway can be pretty confusing and is generally > not a > > > > good > > > > >> > user > > > > >> > > experience IMO. > > > > >> > > > > > > >> > > 2. Wow, thanks for pointing this out - it's a really good > catch > > > and > > > > >> > > something I hadn't noticed was happening with the current > > > > >> implementation. > > > > >> > > While I do like the idea of having a query parameter that > > > determines > > > > >> > > whether validations can be skipped, I'm wondering if it might > > not > > > be > > > > >> > easier > > > > >> > > and cleaner to just do the leader check earlier and avoid > doing > > > the > > > > >> > > unnecessary config validation on the first worker? Since each > > > config > > > > >> > > validation happens on its own thread, I'm not so sure about > the > > > > >> concern > > > > >> > of > > > > >> > > overloading the leader even on larger clusters, especially > since > > > > >> > > validations aren't typically long running operations. > > Furthermore, > > > > >> even > > > > >> > > with the current implementation, the leader will always be > > doing a > > > > >> config > > > > >> > > validation for connector create / update REST API requests on > > any > > > > >> worker. > > > > >> > > > > > > >> > > 3. That's a good point, and this way we can also restrict the > > APIs > > > > >> whose > > > > >> > > timeouts are configurable - I'm thinking `PUT > > > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST > > > /connectors` > > > > >> and > > > > >> > > `PUT /connectors/{connector}/config` are the ones where such a > > > > timeout > > > > >> > > parameter could be useful. Also, do you think we should > enforce > > > some > > > > >> > > reasonable bounds for the timeout config? > > > > >> > > > > > > >> > > On the zombie fencing point, the implication was that the new > > > worker > > > > >> > > property would not control the timeout used for the call to > > > > >> > > Admin::fenceProducers. However, if we go with a timeout query > > > > >> parameter > > > > >> > > approach, even the timeout for the `PUT > > > > /connectors/{connector}/fence' > > > > >> > > endpoint will remain unaffected. > > > > >> > > > > > > >> > > Thanks, > > > > >> > > Yash > > > > >> > > > > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton > > > > <chr...@aiven.io.invalid > > > > >> > > > > > >> > > wrote: > > > > >> > > > > > > >> > > > Hi Yash, > > > > >> > > > > > > > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I > > was > > > > >> > hesitant > > > > >> > > > to support cases where connector validation takes this long, > > but > > > > >> > > > considering the alternative is that we give users a 500 > error > > > > >> response > > > > >> > > but > > > > >> > > > leave the request to create/modify the connector queued up > in > > > the > > > > >> > > herder, I > > > > >> > > > think I can get behind the motivation here. There's also an > > > > >> argument to > > > > >> > > be > > > > >> > > > made about keeping Kafka Connect available even when the > > systems > > > > >> that > > > > >> > it > > > > >> > > > connects to are in a degraded state. > > > > >> > > > > > > > >> > > > I have a few alternatives I'd be interested in your thoughts > > on: > > > > >> > > > > > > > >> > > > 1. Since the primary concern here seems to be that custom > > > > connector > > > > >> > > > validation logic can take too long, do we have any thoughts > on > > > > >> adding > > > > >> > > logic > > > > >> > > > to check for request timeout after validation has completed > > and, > > > > if > > > > >> it > > > > >> > > has, > > > > >> > > > aborting the attempt to create/modify the connector? > > > > >> > > > > > > > >> > > > 2. Right now it's possible that we'll perform two connector > > > config > > > > >> > > > validations per create/modify request; once on the worker > that > > > > >> > initially > > > > >> > > > receives the request, and then again if that worker is not > the > > > > >> leader > > > > >> > of > > > > >> > > > the cluster and has to forward the request to the leader. > Any > > > > >> thoughts > > > > >> > on > > > > >> > > > optimizing this to only require a single validation per > > request? > > > > We > > > > >> > > > probably wouldn't want to force all validations to take > place > > on > > > > the > > > > >> > > leader > > > > >> > > > (could lead to overloading it pretty quickly in large > > clusters), > > > > >> but we > > > > >> > > > could add an internal-only query parameter to skip > validation > > > and > > > > >> then > > > > >> > > use > > > > >> > > > that parameter when forwarding requests from followers to > the > > > > >> leader. > > > > >> > > > > > > > >> > > > 3. A worker property is pretty coarse-grained, and difficult > > to > > > > >> change. > > > > >> > > We > > > > >> > > > might allow per-request toggling of the timeout by adding a > > URL > > > > >> query > > > > >> > > > parameter like '?timeout=90s' to the REST API to allow > > tweaking > > > of > > > > >> the > > > > >> > > > timeout on a more granular basis, and without having to > > perform > > > a > > > > >> > worker > > > > >> > > > restart. > > > > >> > > > > > > > >> > > > I'd also like to clarify a point about the rejected > > alternative > > > > >> "Allow > > > > >> > > > configuring producer zombie fencing admin request > timeout"--is > > > the > > > > >> > > > implication here that the "rest.api.request.timeout.ms" > > > property > > > > >> will > > > > >> > > not > > > > >> > > > control the REST timeout for requests to the 'PUT > > > > >> > > > /connectors/{connector}/fence' endpoint, or just that it > won't > > > > >> control > > > > >> > > the > > > > >> > > > timeout that we use for the call to Admin::fenceProducers? > > > > >> > > > > > > > >> > > > Cheers, > > > > >> > > > > > > > >> > > > Chris > > > > >> > > > > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya < > > > yash.ma...@gmail.com> > > > > >> > wrote: > > > > >> > > > > > > > >> > > > > Hi all, > > > > >> > > > > > > > > >> > > > > I'd like to start a discussion thread on this small KIP - > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW > > > > >> > > > > > > > > >> > > > > It proposes the addition of a new Kafka Connect worker > > > > >> configuration > > > > >> > to > > > > >> > > > > allow configuring REST API request timeouts. > > > > >> > > > > > > > > >> > > > > Thanks, > > > > >> > > > > Yash > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > >