Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Yash Mayya Wed, 08 Mar 2023 07:21:12 -0800

Hi Greg,

Thanks for the response!


1. Hm that's a fair point, and while there aren't any strict guidelines or
rules governing whether something should be a query parameter versus a
request header, I agree that it might be more idiomatic for a request
timeout value to be accepted via a custom request header. What do you
think about using a header name like "X-Request-Timeout", or maybe simply
"Request-Timeout" if we want to take into account
https://www.rfc-editor.org/rfc/rfc6648?

2. 3. 4. Adding asynchronous validations via new endpoints / new request
parameters on existing endpoints is definitely an interesting idea but I'm
not sure whether this KIP should take into consideration a hypothetical
future proposal? If there were already such a concrete proposal, I would
definitely agree that this KIP should take that into consideration - but
that isn't the case today. Furthermore, I'm not sure I understand why the
addition of request timeout configurability to the existing endpoints would
preclude the introduction of an asynchronous validation API in the future?

Thanks,
Yash

On Sat, Mar 4, 2023 at 1:13 AM Greg Harris <greg.har...@aiven.io.invalid>
wrote:

> Yash,
>
> 1.
> Currently the request parameters in the REST API are individual and pertain
> to just one endpoint.
> They also change the content of the query result, or change the action
> taken on the cluster.
> I think that a request timeout is a property of the HTTP request more than
> it is a property of the cluster query or cluster action.
> The two solutions have very similar tradeoffs, but I'm interested in
> whether one is more idiomatic and more obvious to users.
>
> 2.
> I understand that only these three endpoints are in need of increased
> timeouts at this time due to long connector validations.
> From another perspective, this change is making the API more irregular and
> someone in the future might choose to make it more regular by standardizing
> the configurable timeout functionality.
> I wouldn't (in this KIP) dismiss someone's desire to configure other
> timeouts in the future (in another KIP), and design them into a corner.
> It is acceptable to limit the scope of this change to just the three
> endpoints due to practical reasons, but I don't think that should prevent
> us from trying to ensure that this design fits in the "end goal" state of
> the Connect service.
>
> 3. 4.
> I am not suggesting an incompatible change, as the current synchronous
> behavior is still a useful API for certain situations. I think that it is
> possible to add asynchronous validations in a backwards compatible way,
> using new endpoints or other new request parameters.
> The interface could be designed such that users with connectors that exceed
> the synchronous timeouts can utilize the asynchronous API. Tooling can use
> the asynchronous API when it is available, and fall back to the synchronous
> API when it is not.
> I think that it also may be more in-line with the design of the rest of the
> REST API, where nearly every other request is asynchronous. That's why
> you're only targeting these three endpoints, they're the only ones with a
> synchronicity constraint.
> Again, I'm not necessarily saying that you must implement such an
> asynchronous validation scheme in this KIP, but we should consider if that
> is a more extensible solution. If we decided to implement configurable
> synchronous timeouts now, how would that complement an asynchronous API in
> the future?
>
> On Thu, Mar 2, 2023 at 10:00 PM Yash Mayya <yash.ma...@gmail.com> wrote:
>
> > Hi Greg,
> >
> > Thanks for taking a look!
> >
> > 1. I believe Chris suggested the use of a query parameter above as we
> > already have precedents for using query parameters to configure per
> request
> > behavior in Kafka Connect (the restart connectors API and the get
> > connectors API for instance). Also, the limited choice of endpoints
> > targeted is intentional (see my reply to the next point).
> >
> > 2. I intentionally targeted just the three listed endpoints where
> > synchronous connector config validations come into the picture. This is
> > because of the legitimate cases where config validation for specific
> > connector plugins might exceed the default request timeout in edge case
> > scenarios (outlined in the KIP's motivation section). Other Connect REST
> > endpoints shouldn't be taking longer than the default 90 second request
> > timeout; if they do so, it would either be indicative of a bug in the
> > Connect framework or a cluster health issue - neither of which should be
> > covered up by manually setting a longer request timeout.
> >
> > 3. 4. I think changing the config validation behavior would be a backward
> > incompatible change and I wanted to avoid that in this particular KIP.
> > There are multiple programmatic UIs out there which rely on the current
> > synchronous config validation behavior and breaking the existing contract
> > would definitely require a larger discussion.
> >
> > Thanks,
> > Yash
> >
> > On Fri, Mar 3, 2023 at 12:04 AM Greg Harris <greg.har...@aiven.io.invalid
> >
> > wrote:
> >
> > > Hey Yash,
> > >
> > > Thanks for the KIP, and sorry for the late review.
> > >
> > > 1. Have you considered a HTTP header to provide the client-configurable
> > > timeout? A header might more naturally extend to all of the other
> > endpoints
> > > in the future, rather than duplicating the query parameter across
> > > endpoints.
> > >
> > > 2. I understand that this change is targeted at just long duration
> > > Connector::validation calls, is that due to voluntary scope
> constraints?
> > > Implementing configurable timeouts for all endpoints in a uniform way
> > could
> > > be desirable, even if the default timeout will work for nearly all of
> the
> > > other calls.
> > >
> > > 3. Did you consider adding asynchronous validation as a user-facing
> > > feature? I think that relying on the synchronous validation results in
> a
> > > single HTTP request is a bit of an anti-pattern, and one that we've
> > > inherited from the original REST design. It seems useful when using the
> > > REST API by hand, but seems to be a liability when used in environments
> > > with an external management layer.
> > >
> > > 4. Perhaps rather than allowing synchronous calls to Connector:validate
> > to
> > > increase in duration, we should provide a way for connectors to surface
> > > validation problems later in their lifecycle. Currently there is the
> > > ConnectorContext::raiseError that transitions the task to FAILED,
> > perhaps a
> > > similar API could asynchronously emit re-validation results. We've also
> > had
> > > a problem with long start() duration for the same reasons as
> validate().
> > >
> > > I understand if you want to keep this KIP as tightly focused as
> possible,
> > > but i'm worried that it is addressing the symptom and not the problem.
> I
> > > want to make sure that this change is impactful and isn't obsoleted by
> a
> > > later improvement.
> > >
> > > Thanks,
> > > Greg
> > >
> > >
> > > On Wed, Mar 1, 2023 at 5:07 AM Yash Mayya <yash.ma...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > Thanks for all the feedback and discussion. I've renamed the KIP
> title
> > to
> > > > "Kafka Connect REST API timeout improvements" since we're
> introducing a
> > > > couple of improvements (cancelling create / update connector requests
> > > when
> > > > config validations timeout and avoiding double config validations in
> > > > distributed mode) along with making the request timeouts
> configurable.
> > > The
> > > > new KIP link is
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-882%3A+Kafka+Connect+REST+API+timeout+improvements
> > > > and I've called for a vote for the KIP here -
> > > > https://lists.apache.org/thread/mgx8lczx2f57pk7x3vh0nqk00s79grgp.
> > > >
> > > > Thanks,
> > > > Yash
> > > >
> > > > On Sat, Nov 5, 2022 at 11:42 PM Sagar <sagarmeansoc...@gmail.com>
> > wrote:
> > > >
> > > > > Hey Yash,
> > > > >
> > > > > Thanks for the explanation. I think it should be fine to delegate
> the
> > > > > validation directly to the leader.
> > > > >
> > > > > Thanks!
> > > > > Sagar.
> > > > >
> > > > > On Sat, Nov 5, 2022 at 10:42 AM Yash Mayya <yash.ma...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hi Sagar,
> > > > > >
> > > > > > Thanks for chiming in!
> > > > > >
> > > > > > > Having said that, why does the worker forward to the
> > > > > > > leader? I am thinking if the worker can perform the validation
> on
> > > > it's
> > > > > > own,
> > > > > > > we could let it do the validation instead of forwarding
> > everything
> > > to
> > > > > the
> > > > > > > leader
> > > > > >
> > > > > > Only the leader is allowed to perform writes to the config topic,
> > so
> > > > any
> > > > > > request that requires a config topic write ends up being
> forwarded
> > to
> > > > the
> > > > > > leader. The `POST /connectors` and `PUT
> > > /connectors/{connector}/config`
> > > > > > endpoints call `Herder::putConnectorConfig` which internally
> does a
> > > > > config
> > > > > > validation first before initiating another herder request to
> write
> > to
> > > > the
> > > > > > config topic (in distributed mode). If we want to allow the first
> > > > worker
> > > > > to
> > > > > > do the config validation, and then forward the request to the
> > leader
> > > > just
> > > > > > for the write to the config topic, we'd either need something
> like
> > a
> > > > skip
> > > > > > validations query parameter on the endpoint like Chris talks
> about
> > > > above
> > > > > or
> > > > > > else a new internal only endpoint that just does a write to the
> > > config
> > > > > > topic without any prior config validation. However, we agreed
> that
> > > this
> > > > > > optimization doesn't really seem necessary for now and can be
> done
> > > > later
> > > > > if
> > > > > > deemed useful. I'd be happy to hear differing thoughts if any,
> > > however.
> > > > > >
> > > > > > > I think a bound is certainly needed but IMO it shouldn't go
> > beyond
> > > > > > > 10 mins considering this is just validation
> > > > > >
> > > > > > Yeah, I agree that this seems like a fair value - I've elected to
> > go
> > > > > with a
> > > > > > default value of 10 minutes for the proposed worker configuration
> > > that
> > > > > sets
> > > > > > an upper bound for the timeout query parameter.
> > > > > >
> > > > > > Thanks,
> > > > > > Yash
> > > > > >
> > > > > > On Sat, Nov 5, 2022 at 10:30 AM Yash Mayya <yash.ma...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi Chris,
> > > > > > >
> > > > > > > Thanks again for your feedback. I think a worker configuration
> > for
> > > > the
> > > > > > > upper bound makes sense - I initially thought we could hardcode
> > it
> > > > > (just
> > > > > > > like the current request timeout is), but there's no reason to
> > set
> > > > > > another
> > > > > > > artificial bound that isn't user configurable which is exactly
> > what
> > > > > we're
> > > > > > > trying to change here in the first place. I've updated the KIP
> > > based
> > > > on
> > > > > > all
> > > > > > > our discussion above.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Yash
> > > > > > >
> > > > > > > On Thu, Nov 3, 2022 at 11:01 PM Sagar <
> sagarmeansoc...@gmail.com
> > >
> > > > > wrote:
> > > > > > >
> > > > > > >> Hey Yash,
> > > > > > >>
> > > > > > >> Thanks for the KIP! This looks like a useful feature.
> > > > > > >>
> > > > > > >> I think the discussion thread already has some great points by
> > > > Chris.
> > > > > > Just
> > > > > > >> a couple of points/clarifications=>
> > > > > > >>
> > > > > > >> Regarding, pt#2 , I guess it might be better to forward to the
> > > > leader
> > > > > as
> > > > > > >> suggested by Yash. Having said that, why does the worker
> forward
> > > to
> > > > > the
> > > > > > >> leader? I am thinking if the worker can perform the validation
> > on
> > > > it's
> > > > > > >> own,
> > > > > > >> we could let it do the validation instead of forwarding
> > everything
> > > > to
> > > > > > the
> > > > > > >> leader(even though it might be cheap to forward all requests
> to
> > > the
> > > > > > >> leader).
> > > > > > >>
> > > > > > >> Pt#3 => I think a bound is certainly needed but IMO it
> shouldn't
> > > go
> > > > > > beyond
> > > > > > >> 10 mins considering this is just validation. We shouldn't end
> up
> > > in
> > > > a
> > > > > > >> situation where a few faulty connectors end up blocking a lot
> of
> > > > > request
> > > > > > >> processing threads, so while increasing the config is
> certainly
> > > > > helpful,
> > > > > > >> we
> > > > > > >> shouldn't set too high a value IMO. Of course I am also open
> to
> > > > > > >> suggestions
> > > > > > >> here.
> > > > > > >>
> > > > > > >> Thanks!
> > > > > > >> Sagar.
> > > > > > >>
> > > > > > >> On Thu, Nov 3, 2022 at 9:01 PM Chris Egerton
> > > > <chr...@aiven.io.invalid
> > > > > >
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >> > Hi Yash,
> > > > > > >> >
> > > > > > >> > RE 2: That's a great point about validations already being
> > > > performed
> > > > > > by
> > > > > > >> the
> > > > > > >> > leader. For completeness's sake, I'd like to note that this
> > only
> > > > > holds
> > > > > > >> for
> > > > > > >> > valid configurations; invalid ones are caught right now
> before
> > > > being
> > > > > > >> > forwarded to the leader. Still, I think it's fine to forward
> > to
> > > > the
> > > > > > >> leader
> > > > > > >> > for now and optimize further in the future if necessary. If
> > > > frequent
> > > > > > >> > validations are taking place they should be conducted via
> the
> > > `PUT
> > > > > > >> > /connector-plugins/{pluginName}/config/validate` endpoint,
> > which
> > > > > won't
> > > > > > >> do
> > > > > > >> > any forwarding at all.
> > > > > > >> >
> > > > > > >> > RE 3: Yes, those endpoints LGTM. And yes, bounds on the
> > timeout
> > > > also
> > > > > > >> seem
> > > > > > >> > reasonable... maybe a low-importance worker property could
> > work
> > > > for
> > > > > > >> that?
> > > > > > >> > Not sure what would make sense for a default; probably
> > somewhere
> > > > in
> > > > > > the
> > > > > > >> > 10-60 minute range but would be interested in others'
> > thoughts.
> > > > > > >> >
> > > > > > >> > Thanks for the clarification on the zombie fencing logic. I
> > > think
> > > > we
> > > > > > >> might
> > > > > > >> > want to have some more subtle logic around the interaction
> > > between
> > > > > > >> calls to
> > > > > > >> > Admin::fenceProducers and a worker-level timeout property if
> > we
> > > go
> > > > > > that
> > > > > > >> > route, but we can cross that particular bridge if we get
> back
> > to
> > > > it.
> > > > > > >> >
> > > > > > >> > Cheers,
> > > > > > >> >
> > > > > > >> > Chris
> > > > > > >> >
> > > > > > >> > On Wed, Nov 2, 2022 at 1:48 PM Yash Mayya <
> > yash.ma...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > >> >
> > > > > > >> > > Hi Chris,
> > > > > > >> > >
> > > > > > >> > > Thanks a lot for the super quick response and the great
> > > > feedback!
> > > > > > >> > >
> > > > > > >> > > 1. I think that makes a lot of sense, and I'd be happy to
> > > update
> > > > > the
> > > > > > >> KIP
> > > > > > >> > to
> > > > > > >> > > include this change in the scope. The current behavior
> where
> > > the
> > > > > API
> > > > > > >> > > response indicates a time out but the connector is
> > > > created/updated
> > > > > > >> > > eventually anyway can be pretty confusing and is generally
> > > not a
> > > > > > good
> > > > > > >> > user
> > > > > > >> > > experience IMO.
> > > > > > >> > >
> > > > > > >> > > 2. Wow, thanks for pointing this out - it's a really good
> > > catch
> > > > > and
> > > > > > >> > > something I hadn't noticed was happening with the current
> > > > > > >> implementation.
> > > > > > >> > > While I do like the idea of having a query parameter that
> > > > > determines
> > > > > > >> > > whether validations can be skipped, I'm wondering if it
> > might
> > > > not
> > > > > be
> > > > > > >> > easier
> > > > > > >> > > and cleaner to just do the leader check earlier and avoid
> > > doing
> > > > > the
> > > > > > >> > > unnecessary config validation on the first worker? Since
> > each
> > > > > config
> > > > > > >> > > validation happens on its own thread, I'm not so sure
> about
> > > the
> > > > > > >> concern
> > > > > > >> > of
> > > > > > >> > > overloading the leader even on larger clusters, especially
> > > since
> > > > > > >> > > validations aren't typically long running operations.
> > > > Furthermore,
> > > > > > >> even
> > > > > > >> > > with the current implementation, the leader will always be
> > > > doing a
> > > > > > >> config
> > > > > > >> > > validation for connector create / update REST API requests
> > on
> > > > any
> > > > > > >> worker.
> > > > > > >> > >
> > > > > > >> > > 3. That's a good point, and this way we can also restrict
> > the
> > > > APIs
> > > > > > >> whose
> > > > > > >> > > timeouts are configurable - I'm thinking `PUT
> > > > > > >> > > /connector-plugins/{pluginName}/config/validate`, `POST
> > > > > /connectors`
> > > > > > >> and
> > > > > > >> > > `PUT /connectors/{connector}/config` are the ones where
> > such a
> > > > > > timeout
> > > > > > >> > > parameter could be useful. Also, do you think we should
> > > enforce
> > > > > some
> > > > > > >> > > reasonable bounds for the timeout config?
> > > > > > >> > >
> > > > > > >> > > On the zombie fencing point, the implication was that the
> > new
> > > > > worker
> > > > > > >> > > property would not control the timeout used for the call
> to
> > > > > > >> > > Admin::fenceProducers. However, if we go with a timeout
> > query
> > > > > > >> parameter
> > > > > > >> > > approach, even the timeout for the `PUT
> > > > > > /connectors/{connector}/fence'
> > > > > > >> > > endpoint will remain unaffected.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Yash
> > > > > > >> > >
> > > > > > >> > > On Wed, Nov 2, 2022 at 8:13 PM Chris Egerton
> > > > > > <chr...@aiven.io.invalid
> > > > > > >> >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Hi Yash,
> > > > > > >> > > >
> > > > > > >> > > > Thanks for the KIP. It's a nice, focused change.
> > Initially I
> > > > was
> > > > > > >> > hesitant
> > > > > > >> > > > to support cases where connector validation takes this
> > long,
> > > > but
> > > > > > >> > > > considering the alternative is that we give users a 500
> > > error
> > > > > > >> response
> > > > > > >> > > but
> > > > > > >> > > > leave the request to create/modify the connector queued
> up
> > > in
> > > > > the
> > > > > > >> > > herder, I
> > > > > > >> > > > think I can get behind the motivation here. There's also
> > an
> > > > > > >> argument to
> > > > > > >> > > be
> > > > > > >> > > > made about keeping Kafka Connect available even when the
> > > > systems
> > > > > > >> that
> > > > > > >> > it
> > > > > > >> > > > connects to are in a degraded state.
> > > > > > >> > > >
> > > > > > >> > > > I have a few alternatives I'd be interested in your
> > thoughts
> > > > on:
> > > > > > >> > > >
> > > > > > >> > > > 1. Since the primary concern here seems to be that
> custom
> > > > > > connector
> > > > > > >> > > > validation logic can take too long, do we have any
> > thoughts
> > > on
> > > > > > >> adding
> > > > > > >> > > logic
> > > > > > >> > > > to check for request timeout after validation has
> > completed
> > > > and,
> > > > > > if
> > > > > > >> it
> > > > > > >> > > has,
> > > > > > >> > > > aborting the attempt to create/modify the connector?
> > > > > > >> > > >
> > > > > > >> > > > 2. Right now it's possible that we'll perform two
> > connector
> > > > > config
> > > > > > >> > > > validations per create/modify request; once on the
> worker
> > > that
> > > > > > >> > initially
> > > > > > >> > > > receives the request, and then again if that worker is
> not
> > > the
> > > > > > >> leader
> > > > > > >> > of
> > > > > > >> > > > the cluster and has to forward the request to the
> leader.
> > > Any
> > > > > > >> thoughts
> > > > > > >> > on
> > > > > > >> > > > optimizing this to only require a single validation per
> > > > request?
> > > > > > We
> > > > > > >> > > > probably wouldn't want to force all validations to take
> > > place
> > > > on
> > > > > > the
> > > > > > >> > > leader
> > > > > > >> > > > (could lead to overloading it pretty quickly in large
> > > > clusters),
> > > > > > >> but we
> > > > > > >> > > > could add an internal-only query parameter to skip
> > > validation
> > > > > and
> > > > > > >> then
> > > > > > >> > > use
> > > > > > >> > > > that parameter when forwarding requests from followers
> to
> > > the
> > > > > > >> leader.
> > > > > > >> > > >
> > > > > > >> > > > 3. A worker property is pretty coarse-grained, and
> > difficult
> > > > to
> > > > > > >> change.
> > > > > > >> > > We
> > > > > > >> > > > might allow per-request toggling of the timeout by
> adding
> > a
> > > > URL
> > > > > > >> query
> > > > > > >> > > > parameter like '?timeout=90s' to the REST API to allow
> > > > tweaking
> > > > > of
> > > > > > >> the
> > > > > > >> > > > timeout on a more granular basis, and without having to
> > > > perform
> > > > > a
> > > > > > >> > worker
> > > > > > >> > > > restart.
> > > > > > >> > > >
> > > > > > >> > > > I'd also like to clarify a point about the rejected
> > > > alternative
> > > > > > >> "Allow
> > > > > > >> > > > configuring producer zombie fencing admin request
> > > timeout"--is
> > > > > the
> > > > > > >> > > > implication here that the "rest.api.request.timeout.ms"
> > > > > property
> > > > > > >> will
> > > > > > >> > > not
> > > > > > >> > > > control the REST timeout for requests to the 'PUT
> > > > > > >> > > > /connectors/{connector}/fence' endpoint, or just that it
> > > won't
> > > > > > >> control
> > > > > > >> > > the
> > > > > > >> > > > timeout that we use for the call to
> Admin::fenceProducers?
> > > > > > >> > > >
> > > > > > >> > > > Cheers,
> > > > > > >> > > >
> > > > > > >> > > > Chris
> > > > > > >> > > >
> > > > > > >> > > > On Wed, Nov 2, 2022 at 10:07 AM Yash Mayya <
> > > > > yash.ma...@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi all,
> > > > > > >> > > > >
> > > > > > >> > > > > I'd like to start a discussion thread on this small
> KIP
> > -
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://eu01.z.antigena.com/l/8mqVuSEhdZF05LKGk7J1Cb6esUWht6ps4-73ocd196fgyxC~lGhTLqBEt9BPJy1MBx6RzKVi7qRYYpJwCHRixlmdcayj7YMv~-0fIZJ2f8EjEhp-Qck3~O3Ga1JBTgaY4tTAlwh0VjF6217iBaGVG2PNa-2TO5z8zC5s0Nb28kcG2mUg1tcEfzsAtmWEXWRECoeD4cdUXyPvWeFtYL-ccVv-PmQydgSIYkdeR7~XItAXezft9Mfs5pzjejXSK~MhWW
> > > > > > >> > > > >
> > > > > > >> > > > > It proposes the addition of a new Kafka Connect worker
> > > > > > >> configuration
> > > > > > >> > to
> > > > > > >> > > > > allow configuring REST API request timeouts.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > > Yash
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-882: Make Kafka Connect REST API request timeouts configurable

Reply via email to