Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Yash Mayya Fri, 04 Nov 2022 22:12:18 -0700

Hi Sagar,

Thanks for chiming in!


> Having said that, why does the worker forward to the
> leader? I am thinking if the worker can perform the validation on it's
own,
> we could let it do the validation instead of forwarding everything to the
> leader

Only the leader is allowed to perform writes to the config topic, so any
request that requires a config topic write ends up being forwarded to the
leader. The `POST /connectors` and `PUT /connectors/{connector}/config`
endpoints call `Herder::putConnectorConfig` which internally does a config
validation first before initiating another herder request to write to the
config topic (in distributed mode). If we want to allow the first worker to
do the config validation, and then forward the request to the leader just
for the write to the config topic, we'd either need something like a skip
validations query parameter on the endpoint like Chris talks about above or
else a new internal only endpoint that just does a write to the config
topic without any prior config validation. However, we agreed that this
optimization doesn't really seem necessary for now and can be done later if
deemed useful. I'd be happy to hear differing thoughts if any, however.

> I think a bound is certainly needed but IMO it shouldn't go beyond
> 10 mins considering this is just validation

Yeah, I agree that this seems like a fair value - I've elected to go with a
default value of 10 minutes for the proposed worker configuration that sets
an upper bound for the timeout query parameter.

Thanks,
Yash

On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <yash.ma...@gmail.com> wrote:

> Hi Chris,
>
> Thanks again for your feedback. I think a worker configuration for the
> upper bound makes sense - I initially thought we could hardcode it (just
> like the current request timeout is), but there's no reason to set another
> artificial bound that isn't user configurable which is exactly what we're
> trying to change here in the first place. I've updated the KIP based on all
> our discussion above.
>
> Thanks,
> Yash
>
> On Thu, Nov 3, 2022 at 11:01 PM Sagar <sagarmeansoc...@gmail.com> wrote:
>
>> Hey Yash,
>>
>> Thanks for the KIP! This looks like a useful feature.
>>
>> I think the discussion thread already has some great points by Chris. Just
>> a couple of points/clarifications=>
>>
>> Regarding, pt#2 , I guess it might be better to forward to the leader as
>> suggested by Yash. Having said that, why does the worker forward to the
>> leader? I am thinking if the worker can perform the validation on it's
>> own,
>> we could let it do the validation instead of forwarding everything to the
>> leader(even though it might be cheap to forward all requests to the
>> leader).
>>
>> Pt#3 => I think a bound is certainly needed but IMO it shouldn't go beyond
>> 10 mins considering this is just validation. We shouldn't end up in a
>> situation where a few faulty connectors end up blocking a lot of request
>> processing threads, so while increasing the config is certainly helpful,
>> we
>> shouldn't set too high a value IMO. Of course I am also open to
>> suggestions
>> here.
>>
>> Thanks!
>> Sagar.
>>
>> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton <chr...@aiven.io.invalid>
>> wrote:
>>
>> > Hi Yash,
>> >
>> > RE 2: That's a great point about validations already being performed by
>> the
>> > leader. For completeness's sake, I'd like to note that this only holds
>> for
>> > valid configurations; invalid ones are caught right now before being
>> > forwarded to the leader. Still, I think it's fine to forward to the
>> leader
>> > for now and optimize further in the future if necessary. If frequent
>> > validations are taking place they should be conducted via the `PUT
>> > /connector-plugins/{pluginName}/config/validate` endpoint, which won't
>> do
>> > any forwarding at all.
>> >
>> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the timeout also
>> seem
>> > reasonable... maybe a low-importance worker property could work for
>> that?
>> > Not sure what would make sense for a default; probably somewhere in the
>> > 10-60 minute range but would be interested in others' thoughts.
>> >
>> > Thanks for the clarification on the zombie fencing logic. I think we
>> might
>> > want to have some more subtle logic around the interaction between
>> calls to
>> > Admin::fenceProducers and a worker-level timeout property if we go that
>> > route, but we can cross that particular bridge if we get back to it.
>> >
>> > Cheers,
>> >
>> > Chris
>> >
>> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <yash.ma...@gmail.com> wrote:
>> >
>> > > Hi Chris,
>> > >
>> > > Thanks a lot for the super quick response and the great feedback!
>> > >
>> > > 1. I think that makes a lot of sense, and I'd be happy to update the
>> KIP
>> > to
>> > > include this change in the scope. The current behavior where the API
>> > > response indicates a time out but the connector is created/updated
>> > > eventually anyway can be pretty confusing and is generally not a good
>> > user
>> > > experience IMO.
>> > >
>> > > 2. Wow, thanks for pointing this out - it's a really good catch and
>> > > something I hadn't noticed was happening with the current
>> implementation.
>> > > While I do like the idea of having a query parameter that determines
>> > > whether validations can be skipped, I'm wondering if it might not be
>> > easier
>> > > and cleaner to just do the leader check earlier and avoid doing the
>> > > unnecessary config validation on the first worker? Since each config
>> > > validation happens on its own thread, I'm not so sure about the
>> concern
>> > of
>> > > overloading the leader even on larger clusters, especially since
>> > > validations aren't typically long running operations. Furthermore,
>> even
>> > > with the current implementation, the leader will always be doing a
>> config
>> > > validation for connector create / update REST API requests on any
>> worker.
>> > >
>> > > 3. That's a good point, and this way we can also restrict the APIs
>> whose
>> > > timeouts are configurable - I'm thinking `PUT
>> > > /connector-plugins/{pluginName}/config/validate`, `POST /connectors`
>> and
>> > > `PUT /connectors/{connector}/config` are the ones where such a timeout
>> > > parameter could be useful. Also, do you think we should enforce some
>> > > reasonable bounds for the timeout config?
>> > >
>> > > On the zombie fencing point, the implication was that the new worker
>> > > property would not control the timeout used for the call to
>> > > Admin::fenceProducers. However, if we go with a timeout query
>> parameter
>> > > approach, even the timeout for the `PUT /connectors/{connector}/fence'
>> > > endpoint will remain unaffected.
>> > >
>> > > Thanks,
>> > > Yash
>> > >
>> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton <chr...@aiven.io.invalid
>> >
>> > > wrote:
>> > >
>> > > > Hi Yash,
>> > > >
>> > > > Thanks for the KIP. It's a nice, focused change. Initially I was
>> > hesitant
>> > > > to support cases where connector validation takes this long, but
>> > > > considering the alternative is that we give users a 500 error
>> response
>> > > but
>> > > > leave the request to create/modify the connector queued up in the
>> > > herder, I
>> > > > think I can get behind the motivation here. There's also an
>> argument to
>> > > be
>> > > > made about keeping Kafka Connect available even when the systems
>> that
>> > it
>> > > > connects to are in a degraded state.
>> > > >
>> > > > I have a few alternatives I'd be interested in your thoughts on:
>> > > >
>> > > > 1. Since the primary concern here seems to be that custom connector
>> > > > validation logic can take too long, do we have any thoughts on
>> adding
>> > > logic
>> > > > to check for request timeout after validation has completed and, if
>> it
>> > > has,
>> > > > aborting the attempt to create/modify the connector?
>> > > >
>> > > > 2. Right now it's possible that we'll perform two connector config
>> > > > validations per create/modify request; once on the worker that
>> > initially
>> > > > receives the request, and then again if that worker is not the
>> leader
>> > of
>> > > > the cluster and has to forward the request to the leader. Any
>> thoughts
>> > on
>> > > > optimizing this to only require a single validation per request? We
>> > > > probably wouldn't want to force all validations to take place on the
>> > > leader
>> > > > (could lead to overloading it pretty quickly in large clusters),
>> but we
>> > > > could add an internal-only query parameter to skip validation and
>> then
>> > > use
>> > > > that parameter when forwarding requests from followers to the
>> leader.
>> > > >
>> > > > 3. A worker property is pretty coarse-grained, and difficult to
>> change.
>> > > We
>> > > > might allow per-request toggling of the timeout by adding a URL
>> query
>> > > > parameter like '?timeout=90s' to the REST API to allow tweaking of
>> the
>> > > > timeout on a more granular basis, and without having to perform a
>> > worker
>> > > > restart.
>> > > >
>> > > > I'd also like to clarify a point about the rejected alternative
>> "Allow
>> > > > configuring producer zombie fencing admin request timeout"--is the
>> > > > implication here that the "rest.api.request.timeout.ms" property
>> will
>> > > not
>> > > > control the REST timeout for requests to the 'PUT
>> > > > /connectors/{connector}/fence' endpoint, or just that it won't
>> control
>> > > the
>> > > > timeout that we use for the call to Admin::fenceProducers?
>> > > >
>> > > > Cheers,
>> > > >
>> > > > Chris
>> > > >
>> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <yash.ma...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I'd like to start a discussion thread on this small KIP -
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
>> > > > >
>> > > > > It proposes the addition of a new Kafka Connect worker
>> configuration
>> > to
>> > > > > allow configuring REST API request timeouts.
>> > > > >
>> > > > > Thanks,
>> > > > > Yash
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Reply via email to