Hey Yash, Thanks for the explanation. I think it should be fine to delegate the validation directly to the leader.
Thanks! Sagar. On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <yash.ma...@gmail.com> wrote: > Hi Sagar, > > Thanks for chiming in! > > > Having said that, why does the worker forward to the > > leader? I am thinking if the worker can perform the validation on it's > own, > > we could let it do the validation instead of forwarding everything to the > > leader > > Only the leader is allowed to perform writes to the config topic, so any > request that requires a config topic write ends up being forwarded to the > leader. The `POST /connectors` and `PUT /connectors/{connector}/config` > endpoints call `Herder::putConnectorConfig` which internally does a config > validation first before initiating another herder request to write to the > config topic (in distributed mode). If we want to allow the first worker to > do the config validation, and then forward the request to the leader just > for the write to the config topic, we'd either need something like a skip > validations query parameter on the endpoint like Chris talks about above or > else a new internal only endpoint that just does a write to the config > topic without any prior config validation. However, we agreed that this > optimization doesn't really seem necessary for now and can be done later if > deemed useful. I'd be happy to hear differing thoughts if any, however. > > > I think a bound is certainly needed but IMO it shouldn't go beyond > > 10 mins considering this is just validation > > Yeah, I agree that this seems like a fair value - I've elected to go with a > default value of 10 minutes for the proposed worker configuration that sets > an upper bound for the timeout query parameter. > > Thanks, > Yash > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <yash.ma...@gmail.com> wrote: > > > Hi Chris, > > > > Thanks again for your feedback. I think a worker configuration for the > > upper bound makes sense - I initially thought we could hardcode it (just > > like the current request timeout is), but there's no reason to set > another > > artificial bound that isn't user configurable which is exactly what we're > > trying to change here in the first place. I've updated the KIP based on > all > > our discussion above. > > > > Thanks, > > Yash > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sagarmeansoc...@gmail.com> wrote: > > > >> Hey Yash, > >> > >> Thanks for the KIP! This looks like a useful feature. > >> > >> I think the discussion thread already has some great points by Chris. > Just > >> a couple of points/clarifications=> > >> > >> Regarding, pt#2 , I guess it might be better to forward to the leader as > >> suggested by Yash. Having said that, why does the worker forward to the > >> leader? I am thinking if the worker can perform the validation on it's > >> own, > >> we could let it do the validation instead of forwarding everything to > the > >> leader(even though it might be cheap to forward all requests to the > >> leader). > >> > >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go > beyond > >> 10 mins considering this is just validation. We shouldn't end up in a > >> situation where a few faulty connectors end up blocking a lot of request > >> processing threads, so while increasing the config is certainly helpful, > >> we > >> shouldn't set too high a value IMO. Of course I am also open to > >> suggestions > >> here. > >> > >> Thanks! > >> Sagar. > >> > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <chr...@aiven.io.invalid> > >> wrote: > >> > >> > Hi Yash, > >> > > >> > RE 2: That's a great point about validations already being performed > by > >> the > >> > leader. For completeness's sake, I'd like to note that this only holds > >> for > >> > valid configurations; invalid ones are caught right now before being > >> > forwarded to the leader. Still, I think it's fine to forward to the > >> leader > >> > for now and optimize further in the future if necessary. If frequent > >> > validations are taking place they should be conducted via the `PUT > >> > /connector-plugins/{pluginName}/config/validate` endpoint, which won't > >> do > >> > any forwarding at all. > >> > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also > >> seem > >> > reasonable... maybe a low-importance worker property could work for > >> that? > >> > Not sure what would make sense for a default; probably somewhere in > the > >> > 10-60 minute range but would be interested in others' thoughts. > >> > > >> > Thanks for the clarification on the zombie fencing logic. I think we > >> might > >> > want to have some more subtle logic around the interaction between > >> calls to > >> > Admin::fenceProducers and a worker-level timeout property if we go > that > >> > route, but we can cross that particular bridge if we get back to it. > >> > > >> > Cheers, > >> > > >> > Chris > >> > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <yash.ma...@gmail.com> > wrote: > >> > > >> > > Hi Chris, > >> > > > >> > > Thanks a lot for the super quick response and the great feedback! > >> > > > >> > > 1. I think that makes a lot of sense, and I'd be happy to update the > >> KIP > >> > to > >> > > include this change in the scope. The current behavior where the API > >> > > response indicates a time out but the connector is created/updated > >> > > eventually anyway can be pretty confusing and is generally not a > good > >> > user > >> > > experience IMO. > >> > > > >> > > 2. Wow, thanks for pointing this out - it's a really good catch and > >> > > something I hadn't noticed was happening with the current > >> implementation. > >> > > While I do like the idea of having a query parameter that determines > >> > > whether validations can be skipped, I'm wondering if it might not be > >> > easier > >> > > and cleaner to just do the leader check earlier and avoid doing the > >> > > unnecessary config validation on the first worker? Since each config > >> > > validation happens on its own thread, I'm not so sure about the > >> concern > >> > of > >> > > overloading the leader even on larger clusters, especially since > >> > > validations aren't typically long running operations. Furthermore, > >> even > >> > > with the current implementation, the leader will always be doing a > >> config > >> > > validation for connector create / update REST API requests on any > >> worker. > >> > > > >> > > 3. That's a good point, and this way we can also restrict the APIs > >> whose > >> > > timeouts are configurable - I'm thinking `PUT > >> > > /connector-plugins/{pluginName}/config/validate`, `POST /connectors` > >> and > >> > > `PUT /connectors/{connector}/config` are the ones where such a > timeout > >> > > parameter could be useful. Also, do you think we should enforce some > >> > > reasonable bounds for the timeout config? > >> > > > >> > > On the zombie fencing point, the implication was that the new worker > >> > > property would not control the timeout used for the call to > >> > > Admin::fenceProducers. However, if we go with a timeout query > >> parameter > >> > > approach, even the timeout for the `PUT > /connectors/{connector}/fence' > >> > > endpoint will remain unaffected. > >> > > > >> > > Thanks, > >> > > Yash > >> > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton > <chr...@aiven.io.invalid > >> > > >> > > wrote: > >> > > > >> > > > Hi Yash, > >> > > > > >> > > > Thanks for the KIP. It's a nice, focused change. Initially I was > >> > hesitant > >> > > > to support cases where connector validation takes this long, but > >> > > > considering the alternative is that we give users a 500 error > >> response > >> > > but > >> > > > leave the request to create/modify the connector queued up in the > >> > > herder, I > >> > > > think I can get behind the motivation here. There's also an > >> argument to > >> > > be > >> > > > made about keeping Kafka Connect available even when the systems > >> that > >> > it > >> > > > connects to are in a degraded state. > >> > > > > >> > > > I have a few alternatives I'd be interested in your thoughts on: > >> > > > > >> > > > 1. Since the primary concern here seems to be that custom > connector > >> > > > validation logic can take too long, do we have any thoughts on > >> adding > >> > > logic > >> > > > to check for request timeout after validation has completed and, > if > >> it > >> > > has, > >> > > > aborting the attempt to create/modify the connector? > >> > > > > >> > > > 2. Right now it's possible that we'll perform two connector config > >> > > > validations per create/modify request; once on the worker that > >> > initially > >> > > > receives the request, and then again if that worker is not the > >> leader > >> > of > >> > > > the cluster and has to forward the request to the leader. Any > >> thoughts > >> > on > >> > > > optimizing this to only require a single validation per request? > We > >> > > > probably wouldn't want to force all validations to take place on > the > >> > > leader > >> > > > (could lead to overloading it pretty quickly in large clusters), > >> but we > >> > > > could add an internal-only query parameter to skip validation and > >> then > >> > > use > >> > > > that parameter when forwarding requests from followers to the > >> leader. > >> > > > > >> > > > 3. A worker property is pretty coarse-grained, and difficult to > >> change. > >> > > We > >> > > > might allow per-request toggling of the timeout by adding a URL > >> query > >> > > > parameter like '?timeout=90s' to the REST API to allow tweaking of > >> the > >> > > > timeout on a more granular basis, and without having to perform a > >> > worker > >> > > > restart. > >> > > > > >> > > > I'd also like to clarify a point about the rejected alternative > >> "Allow > >> > > > configuring producer zombie fencing admin request timeout"--is the > >> > > > implication here that the "rest.api.request.timeout.ms" property > >> will > >> > > not > >> > > > control the REST timeout for requests to the 'PUT > >> > > > /connectors/{connector}/fence' endpoint, or just that it won't > >> control > >> > > the > >> > > > timeout that we use for the call to Admin::fenceProducers? > >> > > > > >> > > > Cheers, > >> > > > > >> > > > Chris > >> > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <yash.ma...@gmail.com> > >> > wrote: > >> > > > > >> > > > > Hi all, > >> > > > > > >> > > > > I'd like to start a discussion thread on this small KIP - > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW > >> > > > > > >> > > > > It proposes the addition of a new Kafka Connect worker > >> configuration > >> > to > >> > > > > allow configuring REST API request timeouts. > >> > > > > > >> > > > > Thanks, > >> > > > > Yash > >> > > > > > >> > > > > >> > > > >> > > >> > > >