Chris, I was about to vote too but have one trivial question. You designed this health check for a completely generic connect runtime, regardless of which connectors it's running, right?
Because if I run a well-known to me set of connectors, i.e. mm2 :-) , then checking the connectors and tasks state is a good healthcheck already. In such case, I think the value KIP-1017 is allow to find a problem in the runtime rather than the connectors. would you agree? On Fri, 14 Jun 2024 at 15:44, Chris Egerton <fearthecel...@gmail.com> wrote: > > Thanks Viktor! Seems as good a time as any to kick off the vote thread. > > On Fri, Jun 14, 2024, 10:43 Viktor Somogyi-Vass > <viktor.somo...@cloudera.com.invalid> wrote: > > > Hi Chris, > > > > Both points make sense to me. > > > > 1) I'm good with your original solution where it would return after 10 > > seconds for both 5xx endpoints. The second solution seems complicated and > > maybe we shouldn't overcomplicate it. > > 2) Unfortunately I can't commit to that hypothetical KIP but I'll make a > > mental note :) > > > > I'm +1 on the KIP. > > > > Thanks, > > Viktor > > > > On Tue, Jun 11, 2024 at 9:53 PM Chris Egerton <chr...@aiven.io.invalid> > > wrote: > > > > > Hi Viktor, > > > > > > Thanks for your comments! > > > > > > 1) I'm realizing now that there's some implicit behavior in the KIP that > > I > > > should spell out explicitly. I was thinking that the timeout of 10 > > seconds > > > that I mentioned in the "Public Interfaces" section would have to elapse > > > before any 5xx responses were issued. So, if someone pinged the health > > > endpoint while the worker was starting up, the worker would take up to 10 > > > seconds to try to complete startup before giving up and responding to the > > > request with a 503 status. This would obviate the need for a Retry-After > > > header because it would apply a natural rate limiting to these requests > > > during worker startup. Does this seem reasonable? We could diverge in > > > behavior and only apply the 10 second timeout to the 500 case (i.e., > > worker > > > has completed startup but is not live for other reasons), at which point > > a > > > Retry-After header for the 503 case (worker still starting up) would make > > > sense, but I can't think of any benefits to this approach. Thoughts? > > > > > > 2) As of KAFKA-15563 [1], we have some great infrastructure in place to > > > identify worker actions that might be good candidates for the contents of > > > this "worker events" topic. But I don't think conflating the retrieval of > > > these events with the health endpoint is a good idea--IMO it should be a > > > separate endpoint and the health endpoint should stay lightweight and > > > simple. I'm also not sure it's necessary to expose the contents of this > > > kind of topic via the REST API at all; we could instruct users to consume > > > directly from the topic if they'd like to know the history of the worker. > > > Overall it seems like a decent idea and I'd be happy to review a KIP for > > > it, but like you mention, it seems like a pretty drastic change in scope > > > and I don't think it needs to be included in this proposal. > > > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563 > > > > > > Cheers, > > > > > > Chris > > > > > > On Tue, Jun 11, 2024 at 11:42 AM Viktor Somogyi-Vass > > > <viktor.somo...@cloudera.com.invalid> wrote: > > > > > > > Hi Chris, > > > > > > > > I also have 2 other comments: > > > > > > > > 1. One more thing I came across is that should we provide the > > Retry-After > > > > header in the response in case of 503 response? Although I'm not sure > > how > > > > many clients honor this, we could add it just in case some does and if > > > you > > > > also find it useful. (We could default it to retry.backoff.ms.) > > > > > > > > 2. Adding to Adrian's comments, storing timestamped worker statuses in > > an > > > > internal topic and then reading them from there would add valuable info > > > > about what the worker does. For instance GET > > > /health?startTime=45345323346 > > > > could fetch events from the given timestamp additionally to your > > proposed > > > > behavior. Also, if the rest server isn't available, it would serve in > > > > itself as a log about the workers' behavior. I understand if you think > > > it's > > > > such a scope change that it should be an improvement KIP, but would > > like > > > to > > > > gauge what you think about this. > > > > > > > > Regards, > > > > Viktor > > > > > > > > On Tue, Jun 11, 2024 at 4:34 PM Chris Egerton <chr...@aiven.io.invalid > > > > > > > wrote: > > > > > > > > > Hi Adrian, > > > > > > > > > > Thanks for your comments/questions! The responses to them are related > > > so > > > > > I'll try to address both at once. > > > > > > > > > > The most recent update I made to the KIP should help provide insight > > > into > > > > > what's going wrong if a non-200 response is returned. I don't plan on > > > > > adding any structured data such as error codes or something like a > > > > "phase" > > > > > field with values like READING_CONFIG_TOPIC quite yet, but there is > > > room > > > > > for us to add human-readable information on the causes of failure in > > > the > > > > > "message" field (see KAFKA-15563 [1] and its PR [2] for an example of > > > > what > > > > > kind of information we might provide to users). Part of the problem > > is > > > > that > > > > > while I've heard plenty of (justified!) complaints about the Kafka > > > > Connect > > > > > REST API becoming unavailable and the difficulties users face with > > > > > debugging their workers when that happens, I still don't feel we > > have a > > > > > strong-enough grasp on the common causes for this scenario to commit > > > to a > > > > > response format that could be more machine-readable, and it can be > > > > > surprisingly difficult to get to a root cause in some cases. > > > > > > > > > > I'm anticipating that users will rely on the endpoint primarily for > > two > > > > > things: > > > > > 1) Ensuring that a worker has completed startup successfully during a > > > > > rolling upgrade (if you don't get a 200 after long enough, abort the > > > > > upgrade, check the error message, and start investigating) > > > > > 2) Ensuring that a worker remains healthy after it has joined the > > > cluster > > > > > (if you don't get a 200 for a sustained period of time, check the > > error > > > > > message, and then decide whether to restart the process or issue a > > > page) > > > > > > > > > > It's primarily designed to be easy to incorporate with automated > > > tooling > > > > > that has support for REST-based process health monitoring, while also > > > > > providing some human-readable information (when possible) if the > > worker > > > > > isn't healthy. This human-readable information should hopefully help > > > > people > > > > > gauge how to respond to non-200 responses, and we can try to improve > > > > > wording and granularity over time based on user feedback. You and > > other > > > > > users may develop automated responses based on the content of the > > error > > > > > messages, but beware that the wording may change across releases. > > > > > > > > > > Does that seem reasonable for V1 of this feature? I can definitely > > see > > > > room > > > > > for expansion of the response format in the future, but would like to > > > > hold > > > > > off on that for now. > > > > > > > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563 > > > > > [2] - https://github.com/apache/kafka/pull/14562 > > > > > > > > > > Cheers, > > > > > > > > > > Chris > > > > > > > > > > On Tue, Jun 11, 2024 at 3:37 AM Adrian Preston <prest...@uk.ibm.com> > > > > > wrote: > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > Good KIP – I think it will be very helpful in alerting and > > automating > > > > the > > > > > > resolution of common Connect problems. > > > > > > > > > > > > I have a couple of questions / suggestions: > > > > > > > > > > > > 1. What are you planning on documenting as guidance for using this > > > new > > > > > > endpoint? My guess would be that if Connect doesn’t return a status > > > of > > > > > 200 > > > > > > after some period I would either page someone, or restart the > > > process? > > > > > But > > > > > > I’m missing the nuance of distinguishing between readiness and > > > > liveness, > > > > > is > > > > > > this for maintaining overall availability when rolling out updates > > to > > > > > > several Connect processes? > > > > > > > > > > > > 2. Would you consider providing a way to discover details about > > > exactly > > > > > > which condition (or conditions) is/are failing? Perhaps by > > providing > > > a > > > > > > richer JSON response? Something like this would help us adopt the > > > > health > > > > > > check, as we could start by paging someone for all failures, then > > > over > > > > > time > > > > > > (as we gained confidence) transition more of the conditions over to > > > > being > > > > > > handled by automation. > > > > > > > > > > > > Regards, > > > > > > - Adrian > > > > > > > > > > > > > > > > > > From: Chris Egerton <chr...@aiven.io.INVALID> > > > > > > Date: Monday, 10 June 2024 at 15:26 > > > > > > To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > > Subject: [EXTERNAL] Re: [DISCUSS] KIP-1017: A health check endpoint > > > for > > > > > > Kafka Connect > > > > > > Hi all, > > > > > > > > > > > > Thanks for the positive feedback! > > > > > > > > > > > > I've made one small addition to the KIP since there's been a change > > > to > > > > > our > > > > > > REST timeout error messages that's worth incorporating here. > > Quoting > > > > the > > > > > > added section directly: > > > > > > > > > > > > > Note that the HTTP status codes and "status" fields in the JSON > > > > > response > > > > > > will match the exact examples above. However, the "message" field > > may > > > > be > > > > > > augmented to include, among other things, more information about > > the > > > > > > operation(s) the worker could be blocked on (such as was added in > > > REST > > > > > > timeout error messages in KAFKA-15563 [1]). > > > > > > > > > > > > Assuming this still looks okay to everyone, I'll kick off a vote > > > thread > > > > > > sometime this week or the next. > > > > > > > > > > > > [1] - https://issues.apache.org/jira/browse/KAFKA-15563 > > > > > > > > > > > > Cheers, > > > > > > > > > > > > Chris > > > > > > > > > > > > On Fri, Jun 7, 2024 at 12:01 PM Andrew Schofield < > > > > > > andrew_schofi...@live.com> > > > > > > wrote: > > > > > > > > > > > > > Hi Chris, > > > > > > > This KIP looks good to me. I particularly like the explanation of > > > how > > > > > the > > > > > > > result will specifically > > > > > > > check the worker health in ways that have previously caused > > > trouble. > > > > > > > > > > > > > > Thanks, > > > > > > > Andrew > > > > > > > > > > > > > > > On 7 Jun 2024, at 16:18, Mickael Maison < > > > mickael.mai...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > > Hi Chris, > > > > > > > > > > > > > > > > Happy Friday! The KIP looks good to me. +1 > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Mickael > > > > > > > > > > > > > > > > On Fri, Jan 26, 2024 at 8:41 PM Chris Egerton > > > > > <chr...@aiven.io.invalid > > > > > > > > > > > > > > wrote: > > > > > > > >> > > > > > > > >> Hi all, > > > > > > > >> > > > > > > > >> Happy Friday! I'd like to kick off discussion for KIP-1017, > > > which > > > > > (as > > > > > > > the > > > > > > > >> title suggests) proposes adding a health check endpoint for > > > Kafka > > > > > > > Connect: > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1017%3A+Health+check+endpoint+for+Kafka+Connect > > > > > > > >> > > > > > > > >> This is one of the longest-standing issues with Kafka Connect > > > and > > > > > I'm > > > > > > > >> hoping we can finally put it in the ground soon. Looking > > forward > > > > to > > > > > > > hearing > > > > > > > >> people's thoughts! > > > > > > > >> > > > > > > > >> Cheers, > > > > > > > >> > > > > > > > >> Chris > > > > > > > > > > > > > > > > > > > > > > > > > > Unless otherwise stated above: > > > > > > > > > > > > IBM United Kingdom Limited > > > > > > Registered in England and Wales with number 741598 > > > > > > Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 > > > 3AU > > > > > > > > > > > > > > > > > > > >