I'd like to clarify the motivation for this PIP. My understanding is that the primary motivation is to give clients a robust way to classify a cluster as "healthy". The initial beneficiary of this feature is the auto failover use case. I think the feature makes sense, but before using the broker's concept of "healthy" as defined in the broker health check, I think we should define what constitutes a "healthy cluster" from the client's perspective.
I define a client's primary cluster as "healthy" when it is "healthy" for all of its producers and consumers. I define a healthy producer as one that can connect to a topic and publish messages within certain latency and throughput thresholds (configured by the user), and I define a healthy consumer as one that can connect to a topic and consume messages when there are messages to be consumed (possibly within a certain latency?). By the above definitions, I don't think the broker's health check will give us the right notion of "healthy" because that health check monitors producing/consuming to/from the health check topic, not the client's target topics. One primary difference is that a health check topic could have a different persistence policy, which means the client could incorrectly classify the broker as healthy when there aren't enough available bookies for a producer's target topic. The broker health check also includes checks that we probably don't want to use to classify whole clusters as "unhealthy". For example, if the broker is deadlocked, it will be considered unhealthy. In Kubernetes, that broker will be restarted "soon", and the topics will be scheduled to another broker. I probably wouldn't consider a whole cluster as "unhealthy" because a single broker was deadlocked. Instead, I'd consider a cluster unhealthy when latency/throughput are not meeting expectations, which could happen because a broker is deadlocked. Further, there is a chance that the deadlock in the broker didn't affect the client's producers and consumers, which is yet another reason not to failover to another cluster based on a failed broker health check. I look forward to hearing your definitions of client health. Thanks, Michael On Wed, Jun 22, 2022 at 8:30 AM Cong Zhao <zhaoc...@apache.org> wrote: > > Yes, there may have multiple clients request the HC at the same time in the > AutoFailover case, so we should add some cache to reduce broker load. > > On 2022/06/22 12:55:49 Enrico Olivelli wrote: > > Il giorno mer 22 giu 2022 alle ore 14:45 Cong Zhao > > <zhaoc...@apache.org> ha scritto: > > > > > > Hi Enrico, > > > > > > > Also, I would like to understand in which usecase you can use the > > > > binary endpoint and not the HTTP endpoint. > > > > > > We can't use the HTTP endpoint when the client did not have the admin > > > auth to do a health check. but we need it in some cases such as auto > > > failover on the client-side > > > (https://github.com/apache/pulsar/pull/13316#discussion_r773313991) > > AutoFailover is a valid use case for me. > > Thanks > > > > > > > > > Health Check is good for scripts and for probes, I don't expect a > > > > "client application" to run the HC > > > > > > Adding a health check API to the client-side just to make it easier to > > > use this feature, this check still works on broker. > > > > makes sense, but usually the HC, like in k8s or in other environments > > is run every X seconds and usually not concurrently > > > > if you have multiple (tens? hundreds?) of Pulsar clients that require > > the HC, this will be a big problem, > > is this the reason why you want to add some cache to the response of the HC > > ? > > > > Enrico > > > > > > > > > > > > > On 2022/06/22 10:19:52 Enrico Olivelli wrote: > > > > I believe that this proposal is too broad. > > > > > > > > the PIP reads about: > > > > - adding HEALTHCHECK to the binary protocol > > > > - add a HEALTHCHECK cache on the broker > > > > > > > > Also, I would like to understand in which usecase you can use the > > > > binary endpoint and not the HTTP endpoint. > > > > > > > > Health Check is good for scripts and for probes, I don't expect a > > > > "client application" to run the HC > > > > > > > > Can you please illustrate some practical use cases? > > > > > > > > Enric > > > > > > > > Il giorno mer 8 giu 2022 alle ore 05:22 zhaocong <zhaoc...@apache.org> > > > > ha scritto: > > > > > > > > > > Hello Pulsar Community, > > > > > > > > > > > > > > > Here is a PIP to introduce the HEALTH_CHECK command in the binary > > > > > protocol. > > > > > I look forward to your feedback. > > > > > > > > > > > > > > > PIP: https://github.com/apache/pulsar/issues/15859 > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Cong Zhao > > > > > >