Hi Michael, Thanks for your feedback.
> I define a client's primary cluster as "healthy" when it is "healthy" for all of its producers and consumers. I define a healthy producer as one that can connect to a topic and publish messages within certain latency and throughput thresholds (configured by the user), and I define a healthy consumer as one that can connect to a topic and consume messages when there are messages to be consumed (possibly within a certain latency?). This is a good definition of cluster health, but we can't check all topics that would add a lot of load on cleint and broker. > By the above definitions, I don't think the broker's health check will give us the right notion of "healthy" because that health check monitors producing/consuming to/from the health check topic, not the client's target topics. One primary difference is that a health check topic could have a different persistence policy, which means the client could incorrectly classify the broker as healthy when there aren't enough available bookies for a producer's target topic. This proposal mainly provides a means to check whether there is available topic in the cluster, and I think this is meaningful in most cases. I think if the client implementation doesn't meet the user's needs, they can also override the `healthCheck` method based on the `HEALTH_CHECK` command. Thanks, Cong Zhao On 2022/06/22 19:06:25 Michael Marshall wrote: > I'd like to clarify the motivation for this PIP. My understanding is > that the primary motivation is to give clients a robust way to > classify a cluster as "healthy". The initial beneficiary of this > feature is the auto failover use case. I think the feature makes > sense, but before using the broker's concept of "healthy" as defined > in the broker health check, I think we should define what constitutes > a "healthy cluster" from the client's perspective. > > I define a client's primary cluster as "healthy" when it is "healthy" > for all of its producers and consumers. I define a healthy producer as > one that can connect to a topic and publish messages within certain > latency and throughput thresholds (configured by the user), and I > define a healthy consumer as one that can connect to a topic and > consume messages when there are messages to be consumed (possibly > within a certain latency?). > > By the above definitions, I don't think the broker's health check will > give us the right notion of "healthy" because that health check > monitors producing/consuming to/from the health check topic, not the > client's target topics. One primary difference is that a health check > topic could have a different persistence policy, which means the > client could incorrectly classify the broker as healthy when there > aren't enough available bookies for a producer's target topic. > > The broker health check also includes checks that we probably don't > want to use to classify whole clusters as "unhealthy". For example, if > the broker is deadlocked, it will be considered unhealthy. In > Kubernetes, that broker will be restarted "soon", and the topics will > be scheduled to another broker. I probably wouldn't consider a > whole cluster as "unhealthy" because a single broker was deadlocked. > Instead, I'd consider a cluster unhealthy when latency/throughput are > not meeting expectations, which could happen because a broker is > deadlocked. Further, there is a chance that the deadlock in the broker > didn't affect the client's producers and consumers, which is yet > another reason not to failover to another cluster based on a failed > broker health check. > > I look forward to hearing your definitions of client health. > > Thanks, > Michael > > > > On Wed, Jun 22, 2022 at 8:30 AM Cong Zhao <zhaoc...@apache.org> wrote: > > > > Yes, there may have multiple clients request the HC at the same time in the > > AutoFailover case, so we should add some cache to reduce broker load. > > > > On 2022/06/22 12:55:49 Enrico Olivelli wrote: > > > Il giorno mer 22 giu 2022 alle ore 14:45 Cong Zhao > > > <zhaoc...@apache.org> ha scritto: > > > > > > > > Hi Enrico, > > > > > > > > > Also, I would like to understand in which usecase you can use the > > > > > binary endpoint and not the HTTP endpoint. > > > > > > > > We can't use the HTTP endpoint when the client did not have the admin > > > > auth to do a health check. but we need it in some cases such as auto > > > > failover on the client-side > > > > (https://github.com/apache/pulsar/pull/13316#discussion_r773313991) > > > AutoFailover is a valid use case for me. > > > Thanks > > > > > > > > > > > > Health Check is good for scripts and for probes, I don't expect a > > > > > "client application" to run the HC > > > > > > > > Adding a health check API to the client-side just to make it easier to > > > > use this feature, this check still works on broker. > > > > > > makes sense, but usually the HC, like in k8s or in other environments > > > is run every X seconds and usually not concurrently > > > > > > if you have multiple (tens? hundreds?) of Pulsar clients that require > > > the HC, this will be a big problem, > > > is this the reason why you want to add some cache to the response of the > > > HC ? > > > > > > Enrico > > > > > > > > > > > > > > > > > > On 2022/06/22 10:19:52 Enrico Olivelli wrote: > > > > > I believe that this proposal is too broad. > > > > > > > > > > the PIP reads about: > > > > > - adding HEALTHCHECK to the binary protocol > > > > > - add a HEALTHCHECK cache on the broker > > > > > > > > > > Also, I would like to understand in which usecase you can use the > > > > > binary endpoint and not the HTTP endpoint. > > > > > > > > > > Health Check is good for scripts and for probes, I don't expect a > > > > > "client application" to run the HC > > > > > > > > > > Can you please illustrate some practical use cases? > > > > > > > > > > Enric > > > > > > > > > > Il giorno mer 8 giu 2022 alle ore 05:22 zhaocong <zhaoc...@apache.org> > > > > > ha scritto: > > > > > > > > > > > > Hello Pulsar Community, > > > > > > > > > > > > > > > > > > Here is a PIP to introduce the HEALTH_CHECK command in the binary > > > > > > protocol. > > > > > > I look forward to your feedback. > > > > > > > > > > > > > > > > > > PIP: https://github.com/apache/pulsar/issues/15859 > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Cong Zhao > > > > > > > > >