Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

Edoardo Comar Fri, 14 Jun 2024 09:13:51 -0700

Chris, I was about to vote too but have one trivial question.

You designed this health check for a completely generic connect runtime,
regardless of which connectors it's running, right?


Because if I run a well-known to me set of connectors, i.e. mm2 :-) ,
then checking the connectors and tasks state is a good healthcheck already.

In such case, I think the value KIP-1017 is allow to find a problem in
the runtime rather than the connectors.
would you agree?

On Fri, 14 Jun 2024 at 15:44, Chris Egerton <fearthecel...@gmail.com> wrote:
>
> Thanks Viktor! Seems as good a time as any to kick off the vote thread.
>
> On Fri, Jun 14, 2024, 10:43 Viktor Somogyi-Vass
> <viktor.somo...@cloudera.com.invalid> wrote:
>
> > Hi Chris,
> >
> > Both points make sense to me.
> >
> > 1) I'm good with your original solution where it would return after 10
> > seconds for both 5xx endpoints. The second solution seems complicated and
> > maybe we shouldn't overcomplicate it.
> > 2) Unfortunately I can't commit to that hypothetical KIP but I'll make a
> > mental note :)
> >
> > I'm +1 on the KIP.
> >
> > Thanks,
> > Viktor
> >
> > On Tue, Jun 11, 2024 at 9:53 PM Chris Egerton <chr...@aiven.io.invalid>
> > wrote:
> >
> > > Hi Viktor,
> > >
> > > Thanks for your comments!
> > >
> > > 1) I'm realizing now that there's some implicit behavior in the KIP that
> > I
> > > should spell out explicitly. I was thinking that the timeout of 10
> > seconds
> > > that I mentioned in the "Public Interfaces" section would have to elapse
> > > before any 5xx responses were issued. So, if someone pinged the health
> > > endpoint while the worker was starting up, the worker would take up to 10
> > > seconds to try to complete startup before giving up and responding to the
> > > request with a 503 status. This would obviate the need for a Retry-After
> > > header because it would apply a natural rate limiting to these requests
> > > during worker startup. Does this seem reasonable? We could diverge in
> > > behavior and only apply the 10 second timeout to the 500 case (i.e.,
> > worker
> > > has completed startup but is not live for other reasons), at which point
> > a
> > > Retry-After header for the 503 case (worker still starting up) would make
> > > sense, but I can't think of any benefits to this approach. Thoughts?
> > >
> > > 2) As of KAFKA-15563 [1], we have some great infrastructure in place to
> > > identify worker actions that might be good candidates for the contents of
> > > this "worker events" topic. But I don't think conflating the retrieval of
> > > these events with the health endpoint is a good idea--IMO it should be a
> > > separate endpoint and the health endpoint should stay lightweight and
> > > simple. I'm also not sure it's necessary to expose the contents of this
> > > kind of topic via the REST API at all; we could instruct users to consume
> > > directly from the topic if they'd like to know the history of the worker.
> > > Overall it seems like a decent idea and I'd be happy to review a KIP for
> > > it, but like you mention, it seems like a pretty drastic change in scope
> > > and I don't think it needs to be included in this proposal.
> > >
> > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > >
> > > Cheers,
> > >
> > > Chris
> > >
> > > On Tue, Jun 11, 2024 at 11:42 AM Viktor Somogyi-Vass
> > > <viktor.somo...@cloudera.com.invalid> wrote:
> > >
> > > > Hi Chris,
> > > >
> > > > I also have 2 other comments:
> > > >
> > > > 1. One more thing I came across is that should we provide the
> > Retry-After
> > > > header in the response in case of 503 response? Although I'm not sure
> > how
> > > > many clients honor this, we could add it just in case some does and if
> > > you
> > > > also find it useful. (We could default it to retry.backoff.ms.)
> > > >
> > > > 2. Adding to Adrian's comments, storing timestamped worker statuses in
> > an
> > > > internal topic and then reading them from there would add valuable info
> > > > about what the worker does. For instance GET
> > > /health?startTime=45345323346
> > > > could fetch events from the given timestamp additionally to your
> > proposed
> > > > behavior. Also, if the rest server isn't available, it would serve in
> > > > itself as a log about the workers' behavior. I understand if you think
> > > it's
> > > > such a scope change that it should be an improvement KIP, but would
> > like
> > > to
> > > > gauge what you think about this.
> > > >
> > > > Regards,
> > > > Viktor
> > > >
> > > > On Tue, Jun 11, 2024 at 4:34 PM Chris Egerton <chr...@aiven.io.invalid
> > >
> > > > wrote:
> > > >
> > > > > Hi Adrian,
> > > > >
> > > > > Thanks for your comments/questions! The responses to them are related
> > > so
> > > > > I'll try to address both at once.
> > > > >
> > > > > The most recent update I made to the KIP should help provide insight
> > > into
> > > > > what's going wrong if a non-200 response is returned. I don't plan on
> > > > > adding any structured data such as error codes or something like a
> > > > "phase"
> > > > > field with values like READING_CONFIG_TOPIC quite yet, but there is
> > > room
> > > > > for us to add human-readable information on the causes of failure in
> > > the
> > > > > "message" field (see KAFKA-15563 [1] and its PR [2] for an example of
> > > > what
> > > > > kind of information we might provide to users). Part of the problem
> > is
> > > > that
> > > > > while I've heard plenty of (justified!) complaints about the Kafka
> > > > Connect
> > > > > REST API becoming unavailable and the difficulties users face with
> > > > > debugging their workers when that happens, I still don't feel we
> > have a
> > > > > strong-enough grasp on the common causes for this scenario to commit
> > > to a
> > > > > response format that could be more machine-readable, and it can be
> > > > > surprisingly difficult to get to a root cause in some cases.
> > > > >
> > > > > I'm anticipating that users will rely on the endpoint primarily for
> > two
> > > > > things:
> > > > > 1) Ensuring that a worker has completed startup successfully during a
> > > > > rolling upgrade (if you don't get a 200 after long enough, abort the
> > > > > upgrade, check the error message, and start investigating)
> > > > > 2) Ensuring that a worker remains healthy after it has joined the
> > > cluster
> > > > > (if you don't get a 200 for a sustained period of time, check the
> > error
> > > > > message, and then decide whether to restart the process or issue a
> > > page)
> > > > >
> > > > > It's primarily designed to be easy to incorporate with automated
> > > tooling
> > > > > that has support for REST-based process health monitoring, while also
> > > > > providing some human-readable information (when possible) if the
> > worker
> > > > > isn't healthy. This human-readable information should hopefully help
> > > > people
> > > > > gauge how to respond to non-200 responses, and we can try to improve
> > > > > wording and granularity over time based on user feedback. You and
> > other
> > > > > users may develop automated responses based on the content of the
> > error
> > > > > messages, but beware that the wording may change across releases.
> > > > >
> > > > > Does that seem reasonable for V1 of this feature? I can definitely
> > see
> > > > room
> > > > > for expansion of the response format in the future, but would like to
> > > > hold
> > > > > off on that for now.
> > > > >
> > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > > > [2] - https://github.com/apache/kafka/pull/14562
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Chris
> > > > >
> > > > > On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston <prest...@uk.ibm.com>
> > > > > wrote:
> > > > >
> > > > > > Hi Chris,
> > > > > >
> > > > > > Good KIP – I think it will be very helpful in alerting and
> > automating
> > > > the
> > > > > > resolution of common Connect problems.
> > > > > >
> > > > > > I have a couple of questions / suggestions:
> > > > > >
> > > > > > 1. What are you planning on documenting as guidance for using this
> > > new
> > > > > > endpoint? My guess would be that if Connect doesn’t return a status
> > > of
> > > > > 200
> > > > > > after some period I would either page someone, or restart the
> > > process?
> > > > > But
> > > > > > I’m missing the nuance of distinguishing between readiness and
> > > > liveness,
> > > > > is
> > > > > > this for maintaining overall availability when rolling out updates
> > to
> > > > > > several Connect processes?
> > > > > >
> > > > > > 2. Would you consider providing a way to discover details about
> > > exactly
> > > > > > which condition (or conditions) is/are failing? Perhaps by
> > providing
> > > a
> > > > > > richer JSON response? Something like this would help us adopt the
> > > > health
> > > > > > check, as we could start by paging someone for all failures, then
> > > over
> > > > > time
> > > > > > (as we gained confidence) transition more of the conditions over to
> > > > being
> > > > > > handled by automation.
> > > > > >
> > > > > > Regards,
> > > > > > - Adrian
> > > > > >
> > > > > >
> > > > > > From: Chris Egerton <chr...@aiven.io.INVALID>
> > > > > > Date: Monday, 10 June 2024 at 15:26
> > > > > > To: dev@kafka.apache.org <dev@kafka.apache.org>
> > > > > > Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check endpoint
> > > for
> > > > > > Kafka Connect
> > > > > > Hi all,
> > > > > >
> > > > > > Thanks for the positive feedback!
> > > > > >
> > > > > > I've made one small addition to the KIP since there's been a change
> > > to
> > > > > our
> > > > > > REST timeout error messages that's worth incorporating here.
> > Quoting
> > > > the
> > > > > > added section directly:
> > > > > >
> > > > > > > Note that the HTTP status codes and "status" fields in the JSON
> > > > > response
> > > > > > will match the exact examples above. However, the "message" field
> > may
> > > > be
> > > > > > augmented to include, among other things, more information about
> > the
> > > > > > operation(s) the worker could be blocked on (such as was added in
> > > REST
> > > > > > timeout error messages in KAFKA-15563 [1]).
> > > > > >
> > > > > > Assuming this still looks okay to everyone, I'll kick off a vote
> > > thread
> > > > > > sometime this week or the next.
> > > > > >
> > > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Fri, Jun 7, 2024 at 12:01 PM Andrew Schofield <
> > > > > > andrew_schofi...@live.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Chris,
> > > > > > > This KIP looks good to me. I particularly like the explanation of
> > > how
> > > > > the
> > > > > > > result will specifically
> > > > > > > check the worker health in ways that have previously caused
> > > trouble.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Andrew
> > > > > > >
> > > > > > > > On 7 Jun 2024, at 16:18, Mickael Maison <
> > > mickael.mai...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Chris,
> > > > > > > >
> > > > > > > > Happy Friday! The KIP looks good to me. +1
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Mickael
> > > > > > > >
> > > > > > > > On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton
> > > > > <chr...@aiven.io.invalid
> > > > > > >
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> Hi all,
> > > > > > > >>
> > > > > > > >> Happy Friday! I'd like to kick off discussion for KIP-1017,
> > > which
> > > > > (as
> > > > > > > the
> > > > > > > >> title suggests) proposes adding a health check endpoint for
> > > Kafka
> > > > > > > Connect:
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect
> > > > > > > >>
> > > > > > > >> This is one of the longest-standing issues with Kafka Connect
> > > and
> > > > > I'm
> > > > > > > >> hoping we can finally put it in the ground soon. Looking
> > forward
> > > > to
> > > > > > > hearing
> > > > > > > >> people's thoughts!
> > > > > > > >>
> > > > > > > >> Cheers,
> > > > > > > >>
> > > > > > > >> Chris
> > > > > > >
> > > > > > >
> > > > > >
> > > > > > Unless otherwise stated above:
> > > > > >
> > > > > > IBM United Kingdom Limited
> > > > > > Registered in England and Wales with number 741598
> > > > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6
> > > 3AU
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

Reply via email to