Hey Yash,

Thanks for the KIP! This looks like a useful feature.

I think the discussion thread already has some great points by Chris. Just
a couple of points/clarifications=>

Regarding, pt#2 , I guess it might be better to forward to the leader as
suggested by Yash. Having said that, why does the worker forward to the
leader? I am thinking if the worker can perform the validation on it's own,
we could let it do the validation instead of forwarding everything to the
leader(even though it might be cheap to forward all requests to the leader).

Pt#3 => I think a bound is certainly needed but IMO it shouldn't go beyond
10 mins considering this is just validation. We shouldn't end up in a
situation where a few faulty connectors end up blocking a lot of request
processing threads, so while increasing the config is certainly helpful, we
shouldn't set too high a value IMO. Of course I am also open to suggestions
here.

Thanks!
Sagar.

On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <chr...@aiven.io.invalid>
wrote:

> Hi Yash,
>
> RE 2: That's a great point about validations already being performed by the
> leader. For completeness's sake, I'd like to note that this only holds for
> valid configurations; invalid ones are caught right now before being
> forwarded to the leader. Still, I think it's fine to forward to the leader
> for now and optimize further in the future if necessary. If frequent
> validations are taking place they should be conducted via the `PUT
> /connector-plugins/{pluginName}/config/validate` endpoint, which won't do
> any forwarding at all.
>
> RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also seem
> reasonable... maybe a low-importance worker property could work for that?
> Not sure what would make sense for a default; probably somewhere in the
> 10-60 minute range but would be interested in others' thoughts.
>
> Thanks for the clarification on the zombie fencing logic. I think we might
> want to have some more subtle logic around the interaction between calls to
> Admin::fenceProducers and a worker-level timeout property if we go that
> route, but we can cross that particular bridge if we get back to it.
>
> Cheers,
>
> Chris
>
> On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <yash.ma...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > Thanks a lot for the super quick response and the great feedback!
> >
> > 1. I think that makes a lot of sense, and I'd be happy to update the KIP
> to
> > include this change in the scope. The current behavior where the API
> > response indicates a time out but the connector is created/updated
> > eventually anyway can be pretty confusing and is generally not a good
> user
> > experience IMO.
> >
> > 2. Wow, thanks for pointing this out - it's a really good catch and
> > something I hadn't noticed was happening with the current implementation.
> > While I do like the idea of having a query parameter that determines
> > whether validations can be skipped, I'm wondering if it might not be
> easier
> > and cleaner to just do the leader check earlier and avoid doing the
> > unnecessary config validation on the first worker? Since each config
> > validation happens on its own thread, I'm not so sure about the concern
> of
> > overloading the leader even on larger clusters, especially since
> > validations aren't typically long running operations. Furthermore, even
> > with the current implementation, the leader will always be doing a config
> > validation for connector create / update REST API requests on any worker.
> >
> > 3. That's a good point, and this way we can also restrict the APIs whose
> > timeouts are configurable - I'm thinking `PUT
> > /connector-plugins/{pluginName}/config/validate`, `POST /connectors` and
> > `PUT /connectors/{connector}/config` are the ones where such a timeout
> > parameter could be useful. Also, do you think we should enforce some
> > reasonable bounds for the timeout config?
> >
> > On the zombie fencing point, the implication was that the new worker
> > property would not control the timeout used for the call to
> > Admin::fenceProducers. However, if we go with a timeout query parameter
> > approach, even the timeout for the `PUT /connectors/{connector}/fence'
> > endpoint will remain unaffected.
> >
> > Thanks,
> > Yash
> >
> > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton <chr...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Yash,
> > >
> > > Thanks for the KIP. It's a nice, focused change. Initially I was
> hesitant
> > > to support cases where connector validation takes this long, but
> > > considering the alternative is that we give users a 500 error response
> > but
> > > leave the request to create/modify the connector queued up in the
> > herder, I
> > > think I can get behind the motivation here. There's also an argument to
> > be
> > > made about keeping Kafka Connect available even when the systems that
> it
> > > connects to are in a degraded state.
> > >
> > > I have a few alternatives I'd be interested in your thoughts on:
> > >
> > > 1. Since the primary concern here seems to be that custom connector
> > > validation logic can take too long, do we have any thoughts on adding
> > logic
> > > to check for request timeout after validation has completed and, if it
> > has,
> > > aborting the attempt to create/modify the connector?
> > >
> > > 2. Right now it's possible that we'll perform two connector config
> > > validations per create/modify request; once on the worker that
> initially
> > > receives the request, and then again if that worker is not the leader
> of
> > > the cluster and has to forward the request to the leader. Any thoughts
> on
> > > optimizing this to only require a single validation per request? We
> > > probably wouldn't want to force all validations to take place on the
> > leader
> > > (could lead to overloading it pretty quickly in large clusters), but we
> > > could add an internal-only query parameter to skip validation and then
> > use
> > > that parameter when forwarding requests from followers to the leader.
> > >
> > > 3. A worker property is pretty coarse-grained, and difficult to change.
> > We
> > > might allow per-request toggling of the timeout by adding a URL query
> > > parameter like '?timeout=90s' to the REST API to allow tweaking of the
> > > timeout on a more granular basis, and without having to perform a
> worker
> > > restart.
> > >
> > > I'd also like to clarify a point about the rejected alternative "Allow
> > > configuring producer zombie fencing admin request timeout"--is the
> > > implication here that the "rest.api.request.timeout.ms" property will
> > not
> > > control the REST timeout for requests to the 'PUT
> > > /connectors/{connector}/fence' endpoint, or just that it won't control
> > the
> > > timeout that we use for the call to Admin::fenceProducers?
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <yash.ma...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to start a discussion thread on this small KIP -
> > > >
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > >
> > > > It proposes the addition of a new Kafka Connect worker configuration
> to
> > > > allow configuring REST API request timeouts.
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > >
> >
>

Reply via email to