Thanks for your response Cong Zhao. > I think don't need to handle it, because the client API should be consistent > with the `HEALTH_CHECK` command result, and users can retry it if they need.
I think I may be coming at this from a different perspective. A client application does not need to know about the health of an individual broker, so I don't think we should expose it explicitly through the client API. A client's primary concern is whether it can send/receive its messages, and a client's notion of cluster health should only be based upon whether it can send/receive its messages. > Also, for the main problem that this proposal wants to solve (how to check a > new cluster is healthy). Do you have another better idea? What are the use cases where a client needs to know the cluster health other than for auto-failover? The question of cluster health is complicated, given that Pulsar is a distributed system, and I do not view the broker health check as a reliable proxy for cluster health. I view it as an indicator of a single broker's health at a single point in time. Thanks, Michael On Fri, Jun 24, 2022 at 4:43 AM Cong Zhao <zhaoc...@apache.org> wrote: > > Hi Michael, > > > I think the current PIP might need some clarification on how errors > > are handled. For example, if a single broker fails to respond because > > it was being restarted, how would the client handle that kind of > > failure with this feature? > > I think don't need to handle it, because the client API should be consistent > with the `HEALTH_CHECK` command result, and users can retry it if they need. > > > I wasn't suggesting that the client would need to ask the broker for > > each of the producers/consumers, but rather that the client would > > monitor producers/consumers locally and make decisions about cluster > > health. For example, if a producer cannot connect to its target topic > > after some amount of time or some number of retries, or if a producer > > can connect but cannot publish a message successfully within some > > amount of time, then the client could consider the cluster to be > > unhealthy. > > > > > This proposal mainly provides a means to check whether there is available > > > topic in the cluster, and I think this is meaningful in most cases. > > > > The client will discover if one of its targeted topics is unavailable, > > so instead of monitoring the broker's health check topic, I think the > > client should monitor/failover when a targeted topic is "unavailable" > > for some configured length of time. > > > > I support making the auto-failover logic more robust, but I don't > > think the broker health check is the right signal to use for overall > > cluster health. In my view, the broker's health check is meant to > > signal to orchestrators (like Kubernetes) when a broker ought to be > > restarted. > > For the currently connected cluster, we really can't think the current topic > is unavailable just because the `HEALTH_CHECK` command result is unhealthy, > the current means for auto-failover are relatively rude. I think we improve > it by adding extra measures such as mentioned above to you, but this doesn't > fall within the scope of the proposal. > > Also, for the main problem that this proposal wants to solve (how to check a > new cluster is healthy). Do you have another better idea? > > Thanks, > Cong Zhao > > On 2022/06/24 05:25:46 Michael Marshall wrote: > > Thanks for your replies Cong Zhao. > > > > I think the current PIP might need some clarification on how errors > > are handled. For example, if a single broker fails to respond because > > it was being restarted, how would the client handle that kind of > > failure with this feature? > > > > > This is a good definition of cluster health, but we can't check all > > > topics that would add a lot of load on cleint and broker. > > > > I wasn't suggesting that the client would need to ask the broker for > > each of the producers/consumers, but rather that the client would > > monitor producers/consumers locally and make decisions about cluster > > health. For example, if a producer cannot connect to its target topic > > after some amount of time or some number of retries, or if a producer > > can connect but cannot publish a message successfully within some > > amount of time, then the client could consider the cluster to be > > unhealthy. > > > > > This proposal mainly provides a means to check whether there is available > > > topic in the cluster, and I think this is meaningful in most cases. > > > > The client will discover if one of its targeted topics is unavailable, > > so instead of monitoring the broker's health check topic, I think the > > client should monitor/failover when a targeted topic is "unavailable" > > for some configured length of time. > > > > I support making the auto-failover logic more robust, but I don't > > think the broker health check is the right signal to use for overall > > cluster health. In my view, the broker's health check is meant to > > signal to orchestrators (like Kubernetes) when a broker ought to be > > restarted. > > > > Thanks, > > Michael > > > > > > On Thu, Jun 23, 2022 at 12:35 AM Cong Zhao <zhaoc...@apache.org> wrote: > > > > > > Hi Michael, > > > > > > Thanks for your feedback. > > > > > > > I define a client's primary cluster as "healthy" when it is "healthy" > > > for all of its producers and consumers. I define a healthy producer as > > > one that can connect to a topic and publish messages within certain > > > latency and throughput thresholds (configured by the user), and I > > > define a healthy consumer as one that can connect to a topic and > > > consume messages when there are messages to be consumed (possibly > > > within a certain latency?). > > > > > > This is a good definition of cluster health, but we can't check all > > > topics that would add a lot of load on cleint and broker. > > > > > > > By the above definitions, I don't think the broker's health check will > > > give us the right notion of "healthy" because that health check > > > monitors producing/consuming to/from the health check topic, not the > > > client's target topics. One primary difference is that a health check > > > topic could have a different persistence policy, which means the > > > client could incorrectly classify the broker as healthy when there > > > aren't enough available bookies for a producer's target topic. > > > > > > This proposal mainly provides a means to check whether there is available > > > topic in the cluster, and I think this is meaningful in most cases. > > > > > > I think if the client implementation doesn't meet the user's needs, they > > > can also override the `healthCheck` method based on the `HEALTH_CHECK` > > > command. > > > > > > Thanks, > > > Cong Zhao > > > > > > On 2022/06/22 19:06:25 Michael Marshall wrote: > > > > I'd like to clarify the motivation for this PIP. My understanding is > > > > that the primary motivation is to give clients a robust way to > > > > classify a cluster as "healthy". The initial beneficiary of this > > > > feature is the auto failover use case. I think the feature makes > > > > sense, but before using the broker's concept of "healthy" as defined > > > > in the broker health check, I think we should define what constitutes > > > > a "healthy cluster" from the client's perspective. > > > > > > > > I define a client's primary cluster as "healthy" when it is "healthy" > > > > for all of its producers and consumers. I define a healthy producer as > > > > one that can connect to a topic and publish messages within certain > > > > latency and throughput thresholds (configured by the user), and I > > > > define a healthy consumer as one that can connect to a topic and > > > > consume messages when there are messages to be consumed (possibly > > > > within a certain latency?). > > > > > > > > By the above definitions, I don't think the broker's health check will > > > > give us the right notion of "healthy" because that health check > > > > monitors producing/consuming to/from the health check topic, not the > > > > client's target topics. One primary difference is that a health check > > > > topic could have a different persistence policy, which means the > > > > client could incorrectly classify the broker as healthy when there > > > > aren't enough available bookies for a producer's target topic. > > > > > > > > The broker health check also includes checks that we probably don't > > > > want to use to classify whole clusters as "unhealthy". For example, if > > > > the broker is deadlocked, it will be considered unhealthy. In > > > > Kubernetes, that broker will be restarted "soon", and the topics will > > > > be scheduled to another broker. I probably wouldn't consider a > > > > whole cluster as "unhealthy" because a single broker was deadlocked. > > > > Instead, I'd consider a cluster unhealthy when latency/throughput are > > > > not meeting expectations, which could happen because a broker is > > > > deadlocked. Further, there is a chance that the deadlock in the broker > > > > didn't affect the client's producers and consumers, which is yet > > > > another reason not to failover to another cluster based on a failed > > > > broker health check. > > > > > > > > I look forward to hearing your definitions of client health. > > > > > > > > Thanks, > > > > Michael > > > > > > > > > > > > > > > > On Wed, Jun 22, 2022 at 8:30 AM Cong Zhao <zhaoc...@apache.org> wrote: > > > > > > > > > > Yes, there may have multiple clients request the HC at the same time > > > > > in the AutoFailover case, so we should add some cache to reduce > > > > > broker load. > > > > > > > > > > On 2022/06/22 12:55:49 Enrico Olivelli wrote: > > > > > > Il giorno mer 22 giu 2022 alle ore 14:45 Cong Zhao > > > > > > <zhaoc...@apache.org> ha scritto: > > > > > > > > > > > > > > Hi Enrico, > > > > > > > > > > > > > > > Also, I would like to understand in which usecase you can use > > > > > > > > the > > > > > > > > binary endpoint and not the HTTP endpoint. > > > > > > > > > > > > > > We can't use the HTTP endpoint when the client did not have the > > > > > > > admin auth to do a health check. but we need it in some cases > > > > > > > such as auto failover on the client-side > > > > > > > (https://github.com/apache/pulsar/pull/13316#discussion_r773313991) > > > > > > AutoFailover is a valid use case for me. > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > Health Check is good for scripts and for probes, I don't expect > > > > > > > > a > > > > > > > > "client application" to run the HC > > > > > > > > > > > > > > Adding a health check API to the client-side just to make it > > > > > > > easier to use this feature, this check still works on broker. > > > > > > > > > > > > makes sense, but usually the HC, like in k8s or in other > > > > > > environments > > > > > > is run every X seconds and usually not concurrently > > > > > > > > > > > > if you have multiple (tens? hundreds?) of Pulsar clients that > > > > > > require > > > > > > the HC, this will be a big problem, > > > > > > is this the reason why you want to add some cache to the response > > > > > > of the HC ? > > > > > > > > > > > > Enrico > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 2022/06/22 10:19:52 Enrico Olivelli wrote: > > > > > > > > I believe that this proposal is too broad. > > > > > > > > > > > > > > > > the PIP reads about: > > > > > > > > - adding HEALTHCHECK to the binary protocol > > > > > > > > - add a HEALTHCHECK cache on the broker > > > > > > > > > > > > > > > > Also, I would like to understand in which usecase you can use > > > > > > > > the > > > > > > > > binary endpoint and not the HTTP endpoint. > > > > > > > > > > > > > > > > Health Check is good for scripts and for probes, I don't expect > > > > > > > > a > > > > > > > > "client application" to run the HC > > > > > > > > > > > > > > > > Can you please illustrate some practical use cases? > > > > > > > > > > > > > > > > Enric > > > > > > > > > > > > > > > > Il giorno mer 8 giu 2022 alle ore 05:22 zhaocong > > > > > > > > <zhaoc...@apache.org> > > > > > > > > ha scritto: > > > > > > > > > > > > > > > > > > Hello Pulsar Community, > > > > > > > > > > > > > > > > > > > > > > > > > > > Here is a PIP to introduce the HEALTH_CHECK command in the > > > > > > > > > binary protocol. > > > > > > > > > I look forward to your feedback. > > > > > > > > > > > > > > > > > > > > > > > > > > > PIP: https://github.com/apache/pulsar/issues/15859 > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > Cong Zhao > > > > > > > > > > > > > > > > > > > >