Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Sagar Sat, 05 Nov 2022 11:12:57 -0700

Hey Yash,

Thanks for the explanation. I think it should be fine to delegate the
validation directly to the leader.


Thanks!
Sagar.

On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <yash.ma...@gmail.com> wrote:

> Hi Sagar,
>
> Thanks for chiming in!
>
> > Having said that, why does the worker forward to the
> > leader? I am thinking if the worker can perform the validation on it's
> own,
> > we could let it do the validation instead of forwarding everything to the
> > leader
>
> Only the leader is allowed to perform writes to the config topic, so any
> request that requires a config topic write ends up being forwarded to the
> leader. The `POST /connectors` and `PUT /connectors/{connector}/config`
> endpoints call `Herder::putConnectorConfig` which internally does a config
> validation first before initiating another herder request to write to the
> config topic (in distributed mode). If we want to allow the first worker to
> do the config validation, and then forward the request to the leader just
> for the write to the config topic, we'd either need something like a skip
> validations query parameter on the endpoint like Chris talks about above or
> else a new internal only endpoint that just does a write to the config
> topic without any prior config validation. However, we agreed that this
> optimization doesn't really seem necessary for now and can be done later if
> deemed useful. I'd be happy to hear differing thoughts if any, however.
>
> > I think a bound is certainly needed but IMO it shouldn't go beyond
> > 10 mins considering this is just validation
>
> Yeah, I agree that this seems like a fair value - I've elected to go with a
> default value of 10 minutes for the proposed worker configuration that sets
> an upper bound for the timeout query parameter.
>
> Thanks,
> Yash
>
> On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <yash.ma...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > Thanks again for your feedback. I think a worker configuration for the
> > upper bound makes sense - I initially thought we could hardcode it (just
> > like the current request timeout is), but there's no reason to set
> another
> > artificial bound that isn't user configurable which is exactly what we're
> > trying to change here in the first place. I've updated the KIP based on
> all
> > our discussion above.
> >
> > Thanks,
> > Yash
> >
> > On Thu, Nov 3, 2022 at 11:01 PM Sagar <sagarmeansoc...@gmail.com> wrote:
> >
> >> Hey Yash,
> >>
> >> Thanks for the KIP! This looks like a useful feature.
> >>
> >> I think the discussion thread already has some great points by Chris.
> Just
> >> a couple of points/clarifications=>
> >>
> >> Regarding, pt#2 , I guess it might be better to forward to the leader as
> >> suggested by Yash. Having said that, why does the worker forward to the
> >> leader? I am thinking if the worker can perform the validation on it's
> >> own,
> >> we could let it do the validation instead of forwarding everything to
> the
> >> leader(even though it might be cheap to forward all requests to the
> >> leader).
> >>
> >> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go
> beyond
> >> 10 mins considering this is just validation. We shouldn't end up in a
> >> situation where a few faulty connectors end up blocking a lot of request
> >> processing threads, so while increasing the config is certainly helpful,
> >> we
> >> shouldn't set too high a value IMO. Of course I am also open to
> >> suggestions
> >> here.
> >>
> >> Thanks!
> >> Sagar.
> >>
> >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <chr...@aiven.io.invalid>
> >> wrote:
> >>
> >> > Hi Yash,
> >> >
> >> > RE 2: That's a great point about validations already being performed
> by
> >> the
> >> > leader. For completeness's sake, I'd like to note that this only holds
> >> for
> >> > valid configurations; invalid ones are caught right now before being
> >> > forwarded to the leader. Still, I think it's fine to forward to the
> >> leader
> >> > for now and optimize further in the future if necessary. If frequent
> >> > validations are taking place they should be conducted via the `PUT
> >> > /connector-plugins/{pluginName}/config/validate` endpoint, which won't
> >> do
> >> > any forwarding at all.
> >> >
> >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also
> >> seem
> >> > reasonable... maybe a low-importance worker property could work for
> >> that?
> >> > Not sure what would make sense for a default; probably somewhere in
> the
> >> > 10-60 minute range but would be interested in others' thoughts.
> >> >
> >> > Thanks for the clarification on the zombie fencing logic. I think we
> >> might
> >> > want to have some more subtle logic around the interaction between
> >> calls to
> >> > Admin::fenceProducers and a worker-level timeout property if we go
> that
> >> > route, but we can cross that particular bridge if we get back to it.
> >> >
> >> > Cheers,
> >> >
> >> > Chris
> >> >
> >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <yash.ma...@gmail.com>
> wrote:
> >> >
> >> > > Hi Chris,
> >> > >
> >> > > Thanks a lot for the super quick response and the great feedback!
> >> > >
> >> > > 1. I think that makes a lot of sense, and I'd be happy to update the
> >> KIP
> >> > to
> >> > > include this change in the scope. The current behavior where the API
> >> > > response indicates a time out but the connector is created/updated
> >> > > eventually anyway can be pretty confusing and is generally not a
> good
> >> > user
> >> > > experience IMO.
> >> > >
> >> > > 2. Wow, thanks for pointing this out - it's a really good catch and
> >> > > something I hadn't noticed was happening with the current
> >> implementation.
> >> > > While I do like the idea of having a query parameter that determines
> >> > > whether validations can be skipped, I'm wondering if it might not be
> >> > easier
> >> > > and cleaner to just do the leader check earlier and avoid doing the
> >> > > unnecessary config validation on the first worker? Since each config
> >> > > validation happens on its own thread, I'm not so sure about the
> >> concern
> >> > of
> >> > > overloading the leader even on larger clusters, especially since
> >> > > validations aren't typically long running operations. Furthermore,
> >> even
> >> > > with the current implementation, the leader will always be doing a
> >> config
> >> > > validation for connector create / update REST API requests on any
> >> worker.
> >> > >
> >> > > 3. That's a good point, and this way we can also restrict the APIs
> >> whose
> >> > > timeouts are configurable - I'm thinking `PUT
> >> > > /connector-plugins/{pluginName}/config/validate`, `POST /connectors`
> >> and
> >> > > `PUT /connectors/{connector}/config` are the ones where such a
> timeout
> >> > > parameter could be useful. Also, do you think we should enforce some
> >> > > reasonable bounds for the timeout config?
> >> > >
> >> > > On the zombie fencing point, the implication was that the new worker
> >> > > property would not control the timeout used for the call to
> >> > > Admin::fenceProducers. However, if we go with a timeout query
> >> parameter
> >> > > approach, even the timeout for the `PUT
> /connectors/{connector}/fence'
> >> > > endpoint will remain unaffected.
> >> > >
> >> > > Thanks,
> >> > > Yash
> >> > >
> >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> <chr...@aiven.io.invalid
> >> >
> >> > > wrote:
> >> > >
> >> > > > Hi Yash,
> >> > > >
> >> > > > Thanks for the KIP. It's a nice, focused change. Initially I was
> >> > hesitant
> >> > > > to support cases where connector validation takes this long, but
> >> > > > considering the alternative is that we give users a 500 error
> >> response
> >> > > but
> >> > > > leave the request to create/modify the connector queued up in the
> >> > > herder, I
> >> > > > think I can get behind the motivation here. There's also an
> >> argument to
> >> > > be
> >> > > > made about keeping Kafka Connect available even when the systems
> >> that
> >> > it
> >> > > > connects to are in a degraded state.
> >> > > >
> >> > > > I have a few alternatives I'd be interested in your thoughts on:
> >> > > >
> >> > > > 1. Since the primary concern here seems to be that custom
> connector
> >> > > > validation logic can take too long, do we have any thoughts on
> >> adding
> >> > > logic
> >> > > > to check for request timeout after validation has completed and,
> if
> >> it
> >> > > has,
> >> > > > aborting the attempt to create/modify the connector?
> >> > > >
> >> > > > 2. Right now it's possible that we'll perform two connector config
> >> > > > validations per create/modify request; once on the worker that
> >> > initially
> >> > > > receives the request, and then again if that worker is not the
> >> leader
> >> > of
> >> > > > the cluster and has to forward the request to the leader. Any
> >> thoughts
> >> > on
> >> > > > optimizing this to only require a single validation per request?
> We
> >> > > > probably wouldn't want to force all validations to take place on
> the
> >> > > leader
> >> > > > (could lead to overloading it pretty quickly in large clusters),
> >> but we
> >> > > > could add an internal-only query parameter to skip validation and
> >> then
> >> > > use
> >> > > > that parameter when forwarding requests from followers to the
> >> leader.
> >> > > >
> >> > > > 3. A worker property is pretty coarse-grained, and difficult to
> >> change.
> >> > > We
> >> > > > might allow per-request toggling of the timeout by adding a URL
> >> query
> >> > > > parameter like '?timeout=90s' to the REST API to allow tweaking of
> >> the
> >> > > > timeout on a more granular basis, and without having to perform a
> >> > worker
> >> > > > restart.
> >> > > >
> >> > > > I'd also like to clarify a point about the rejected alternative
> >> "Allow
> >> > > > configuring producer zombie fencing admin request timeout"--is the
> >> > > > implication here that the "rest.api.request.timeout.ms" property
> >> will
> >> > > not
> >> > > > control the REST timeout for requests to the 'PUT
> >> > > > /connectors/{connector}/fence' endpoint, or just that it won't
> >> control
> >> > > the
> >> > > > timeout that we use for the call to Admin::fenceProducers?
> >> > > >
> >> > > > Cheers,
> >> > > >
> >> > > > Chris
> >> > > >
> >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <yash.ma...@gmail.com>
> >> > wrote:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > I'd like to start a discussion thread on this small KIP -
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> >> > > > >
> >> > > > > It proposes the addition of a new Kafka Connect worker
> >> configuration
> >> > to
> >> > > > > allow configuring REST API request timeouts.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Yash
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Reply via email to