Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

Chris Egerton Sat, 15 Jun 2024 10:03:58 -0700

Hi Edoardo,

Yes, you're completely right that this is a health check for the runtime,
not for specific connectors/tasks. The gap that this is meant to address
(which is not covered by the existing status endpoints for
connectors/tasks) is that endpoints related to the
creation/reconfiguration/deletion/etc. of connectors can become unavailable
on a worker, but in a way that doesn't prevent the status endpoints from
being queried, or connectors/tasks that were already assigned to the worker
from running. If this health check endpoint reports that the worker is
healthy, it means that those endpoints should also be available.


Cheers,

Chris

On Fri, Jun 14, 2024 at 12:13 PM Edoardo Comar <edoardli...@gmail.com>
wrote:

> Chris, I was about to vote too but have one trivial question.
>
> You designed this health check for a completely generic connect runtime,
> regardless of which connectors it's running, right?
>
> Because if I run a well-known to me set of connectors, i.e. mm2 :-) ,
> then checking the connectors and tasks state is a good healthcheck already.
>
> In such case, I think the value KIP-1017 is allow to find a problem in
> the runtime rather than the connectors.
> would you agree?
>
> On Fri, 14 Jun 2024 at 15:44, Chris Egerton <fearthecel...@gmail.com>
> wrote:
> >
> > Thanks Viktor! Seems as good a time as any to kick off the vote thread.
> >
> > On Fri, Jun 14, 2024, 10:43 Viktor Somogyi-Vass
> > <viktor.somo...@cloudera.com.invalid> wrote:
> >
> > > Hi Chris,
> > >
> > > Both points make sense to me.
> > >
> > > 1) I'm good with your original solution where it would return after 10
> > > seconds for both 5xx endpoints. The second solution seems complicated
> and
> > > maybe we shouldn't overcomplicate it.
> > > 2) Unfortunately I can't commit to that hypothetical KIP but I'll make
> a
> > > mental note :)
> > >
> > > I'm +1 on the KIP.
> > >
> > > Thanks,
> > > Viktor
> > >
> > > On Tue, Jun 11, 2024 at 9:53 PM Chris Egerton <chr...@aiven.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Viktor,
> > > >
> > > > Thanks for your comments!
> > > >
> > > > 1) I'm realizing now that there's some implicit behavior in the KIP
> that
> > > I
> > > > should spell out explicitly. I was thinking that the timeout of 10
> > > seconds
> > > > that I mentioned in the "Public Interfaces" section would have to
> elapse
> > > > before any 5xx responses were issued. So, if someone pinged the
> health
> > > > endpoint while the worker was starting up, the worker would take up
> to 10
> > > > seconds to try to complete startup before giving up and responding
> to the
> > > > request with a 503 status. This would obviate the need for a
> Retry-After
> > > > header because it would apply a natural rate limiting to these
> requests
> > > > during worker startup. Does this seem reasonable? We could diverge in
> > > > behavior and only apply the 10 second timeout to the 500 case (i.e.,
> > > worker
> > > > has completed startup but is not live for other reasons), at which
> point
> > > a
> > > > Retry-After header for the 503 case (worker still starting up) would
> make
> > > > sense, but I can't think of any benefits to this approach. Thoughts?
> > > >
> > > > 2) As of KAFKA-15563 [1], we have some great infrastructure in place
> to
> > > > identify worker actions that might be good candidates for the
> contents of
> > > > this "worker events" topic. But I don't think conflating the
> retrieval of
> > > > these events with the health endpoint is a good idea--IMO it should
> be a
> > > > separate endpoint and the health endpoint should stay lightweight and
> > > > simple. I'm also not sure it's necessary to expose the contents of
> this
> > > > kind of topic via the REST API at all; we could instruct users to
> consume
> > > > directly from the topic if they'd like to know the history of the
> worker.
> > > > Overall it seems like a decent idea and I'd be happy to review a KIP
> for
> > > > it, but like you mention, it seems like a pretty drastic change in
> scope
> > > > and I don't think it needs to be included in this proposal.
> > > >
> > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > >
> > > > Cheers,
> > > >
> > > > Chris
> > > >
> > > > On Tue, Jun 11, 2024 at 11:42 AM Viktor Somogyi-Vass
> > > > <viktor.somo...@cloudera.com.invalid> wrote:
> > > >
> > > > > Hi Chris,
> > > > >
> > > > > I also have 2 other comments:
> > > > >
> > > > > 1. One more thing I came across is that should we provide the
> > > Retry-After
> > > > > header in the response in case of 503 response? Although I'm not
> sure
> > > how
> > > > > many clients honor this, we could add it just in case some does
> and if
> > > > you
> > > > > also find it useful. (We could default it to retry.backoff.ms.)
> > > > >
> > > > > 2. Adding to Adrian's comments, storing timestamped worker
> statuses in
> > > an
> > > > > internal topic and then reading them from there would add valuable
> info
> > > > > about what the worker does. For instance GET
> > > > /health?startTime=45345323346
> > > > > could fetch events from the given timestamp additionally to your
> > > proposed
> > > > > behavior. Also, if the rest server isn't available, it would serve
> in
> > > > > itself as a log about the workers' behavior. I understand if you
> think
> > > > it's
> > > > > such a scope change that it should be an improvement KIP, but would
> > > like
> > > > to
> > > > > gauge what you think about this.
> > > > >
> > > > > Regards,
> > > > > Viktor
> > > > >
> > > > > On Tue, Jun 11, 2024 at 4:34 PM Chris Egerton
> <chr...@aiven.io.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Adrian,
> > > > > >
> > > > > > Thanks for your comments/questions! The responses to them are
> related
> > > > so
> > > > > > I'll try to address both at once.
> > > > > >
> > > > > > The most recent update I made to the KIP should help provide
> insight
> > > > into
> > > > > > what's going wrong if a non-200 response is returned. I don't
> plan on
> > > > > > adding any structured data such as error codes or something like
> a
> > > > > "phase"
> > > > > > field with values like READING_CONFIG_TOPIC quite yet, but there
> is
> > > > room
> > > > > > for us to add human-readable information on the causes of
> failure in
> > > > the
> > > > > > "message" field (see KAFKA-15563 [1] and its PR [2] for an
> example of
> > > > > what
> > > > > > kind of information we might provide to users). Part of the
> problem
> > > is
> > > > > that
> > > > > > while I've heard plenty of (justified!) complaints about the
> Kafka
> > > > > Connect
> > > > > > REST API becoming unavailable and the difficulties users face
> with
> > > > > > debugging their workers when that happens, I still don't feel we
> > > have a
> > > > > > strong-enough grasp on the common causes for this scenario to
> commit
> > > > to a
> > > > > > response format that could be more machine-readable, and it can
> be
> > > > > > surprisingly difficult to get to a root cause in some cases.
> > > > > >
> > > > > > I'm anticipating that users will rely on the endpoint primarily
> for
> > > two
> > > > > > things:
> > > > > > 1) Ensuring that a worker has completed startup successfully
> during a
> > > > > > rolling upgrade (if you don't get a 200 after long enough, abort
> the
> > > > > > upgrade, check the error message, and start investigating)
> > > > > > 2) Ensuring that a worker remains healthy after it has joined the
> > > > cluster
> > > > > > (if you don't get a 200 for a sustained period of time, check the
> > > error
> > > > > > message, and then decide whether to restart the process or issue
> a
> > > > page)
> > > > > >
> > > > > > It's primarily designed to be easy to incorporate with automated
> > > > tooling
> > > > > > that has support for REST-based process health monitoring, while
> also
> > > > > > providing some human-readable information (when possible) if the
> > > worker
> > > > > > isn't healthy. This human-readable information should hopefully
> help
> > > > > people
> > > > > > gauge how to respond to non-200 responses, and we can try to
> improve
> > > > > > wording and granularity over time based on user feedback. You and
> > > other
> > > > > > users may develop automated responses based on the content of the
> > > error
> > > > > > messages, but beware that the wording may change across releases.
> > > > > >
> > > > > > Does that seem reasonable for V1 of this feature? I can
> definitely
> > > see
> > > > > room
> > > > > > for expansion of the response format in the future, but would
> like to
> > > > > hold
> > > > > > off on that for now.
> > > > > >
> > > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > > > > [2] - https://github.com/apache/kafka/pull/14562
> > > > > >
> > > > > > Cheers,
> > > > > >
> > > > > > Chris
> > > > > >
> > > > > > On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston <
> prest...@uk.ibm.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Chris,
> > > > > > >
> > > > > > > Good KIP – I think it will be very helpful in alerting and
> > > automating
> > > > > the
> > > > > > > resolution of common Connect problems.
> > > > > > >
> > > > > > > I have a couple of questions / suggestions:
> > > > > > >
> > > > > > > 1. What are you planning on documenting as guidance for using
> this
> > > > new
> > > > > > > endpoint? My guess would be that if Connect doesn’t return a
> status
> > > > of
> > > > > > 200
> > > > > > > after some period I would either page someone, or restart the
> > > > process?
> > > > > > But
> > > > > > > I’m missing the nuance of distinguishing between readiness and
> > > > > liveness,
> > > > > > is
> > > > > > > this for maintaining overall availability when rolling out
> updates
> > > to
> > > > > > > several Connect processes?
> > > > > > >
> > > > > > > 2. Would you consider providing a way to discover details about
> > > > exactly
> > > > > > > which condition (or conditions) is/are failing? Perhaps by
> > > providing
> > > > a
> > > > > > > richer JSON response? Something like this would help us adopt
> the
> > > > > health
> > > > > > > check, as we could start by paging someone for all failures,
> then
> > > > over
> > > > > > time
> > > > > > > (as we gained confidence) transition more of the conditions
> over to
> > > > > being
> > > > > > > handled by automation.
> > > > > > >
> > > > > > > Regards,
> > > > > > > - Adrian
> > > > > > >
> > > > > > >
> > > > > > > From: Chris Egerton <chr...@aiven.io.INVALID>
> > > > > > > Date: Monday, 10 June 2024 at 15:26
> > > > > > > To: dev@kafka.apache.org <dev@kafka.apache.org>
> > > > > > > Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check
> endpoint
> > > > for
> > > > > > > Kafka Connect
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Thanks for the positive feedback!
> > > > > > >
> > > > > > > I've made one small addition to the KIP since there's been a
> change
> > > > to
> > > > > > our
> > > > > > > REST timeout error messages that's worth incorporating here.
> > > Quoting
> > > > > the
> > > > > > > added section directly:
> > > > > > >
> > > > > > > > Note that the HTTP status codes and "status" fields in the
> JSON
> > > > > > response
> > > > > > > will match the exact examples above. However, the "message"
> field
> > > may
> > > > > be
> > > > > > > augmented to include, among other things, more information
> about
> > > the
> > > > > > > operation(s) the worker could be blocked on (such as was added
> in
> > > > REST
> > > > > > > timeout error messages in KAFKA-15563 [1]).
> > > > > > >
> > > > > > > Assuming this still looks okay to everyone, I'll kick off a
> vote
> > > > thread
> > > > > > > sometime this week or the next.
> > > > > > >
> > > > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563
> > > > > > >
> > > > > > > Cheers,
> > > > > > >
> > > > > > > Chris
> > > > > > >
> > > > > > > On Fri, Jun 7, 2024 at 12:01 PM Andrew Schofield <
> > > > > > > andrew_schofi...@live.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi Chris,
> > > > > > > > This KIP looks good to me. I particularly like the
> explanation of
> > > > how
> > > > > > the
> > > > > > > > result will specifically
> > > > > > > > check the worker health in ways that have previously caused
> > > > trouble.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Andrew
> > > > > > > >
> > > > > > > > > On 7 Jun 2024, at 16:18, Mickael Maison <
> > > > mickael.mai...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Chris,
> > > > > > > > >
> > > > > > > > > Happy Friday! The KIP looks good to me. +1
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Mickael
> > > > > > > > >
> > > > > > > > > On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton
> > > > > > <chr...@aiven.io.invalid
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> Hi all,
> > > > > > > > >>
> > > > > > > > >> Happy Friday! I'd like to kick off discussion for
> KIP-1017,
> > > > which
> > > > > > (as
> > > > > > > > the
> > > > > > > > >> title suggests) proposes adding a health check endpoint
> for
> > > > Kafka
> > > > > > > > Connect:
> > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect
> > > > > > > > >>
> > > > > > > > >> This is one of the longest-standing issues with Kafka
> Connect
> > > > and
> > > > > > I'm
> > > > > > > > >> hoping we can finally put it in the ground soon. Looking
> > > forward
> > > > > to
> > > > > > > > hearing
> > > > > > > > >> people's thoughts!
> > > > > > > > >>
> > > > > > > > >> Cheers,
> > > > > > > > >>
> > > > > > > > >> Chris
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > > Unless otherwise stated above:
> > > > > > >
> > > > > > > IBM United Kingdom Limited
> > > > > > > Registered in England and Wales with number 741598
> > > > > > > Registered office: PO Box 41, North Harbour, Portsmouth,
> Hants. PO6
> > > > 3AU
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
>

Re: [DISCUSS] KIP-1017: A health check endpoint for Kafka Connect

Reply via email to