Re: [DISCUSS] PIP-172: Introduce the HEALTH_CHECK command in the binary protocol

Cong Zhao Wed, 22 Jun 2022 22:35:46 -0700

Hi Michael,

Thanks for your feedback.


> I define a client's primary cluster as "healthy" when it is "healthy"
for all of its producers and consumers. I define a healthy producer as
one that can connect to a topic and publish messages within certain
latency and throughput thresholds (configured by the user), and I
define a healthy consumer as one that can connect to a topic and
consume messages when there are messages to be consumed (possibly
within a certain latency?).

This is a good definition of cluster health, but we can't check all topics that 
would add a lot of load on cleint and broker.

> By the above definitions, I don't think the broker's health check will
give us the right notion of "healthy" because that health check
monitors producing/consuming to/from the health check topic, not the
client's target topics. One primary difference is that a health check
topic could have a different persistence policy, which means the
client could incorrectly classify the broker as healthy when there
aren't enough available bookies for a producer's target topic.

This proposal mainly provides a means to check whether there is available topic 
in the cluster, and I think this is meaningful in most cases.

I think if the client implementation doesn't meet the user's needs, they can 
also override the `healthCheck` method based on the `HEALTH_CHECK` command.

Thanks,
Cong Zhao

On 2022/06/22 19:06:25 Michael Marshall wrote:
> I'd like to clarify the motivation for this PIP. My understanding is
> that the primary motivation is to give clients a robust way to
> classify a cluster as "healthy". The initial beneficiary of this
> feature is the auto failover use case. I think the feature makes
> sense, but before using the broker's concept of "healthy" as defined
> in the broker health check, I think we should define what constitutes
> a "healthy cluster" from the client's perspective.
> 
> I define a client's primary cluster as "healthy" when it is "healthy"
> for all of its producers and consumers. I define a healthy producer as
> one that can connect to a topic and publish messages within certain
> latency and throughput thresholds (configured by the user), and I
> define a healthy consumer as one that can connect to a topic and
> consume messages when there are messages to be consumed (possibly
> within a certain latency?).
> 
> By the above definitions, I don't think the broker's health check will
> give us the right notion of "healthy" because that health check
> monitors producing/consuming to/from the health check topic, not the
> client's target topics. One primary difference is that a health check
> topic could have a different persistence policy, which means the
> client could incorrectly classify the broker as healthy when there
> aren't enough available bookies for a producer's target topic.
> 
> The broker health check also includes checks that we probably don't
> want to use to classify whole clusters as "unhealthy". For example, if
> the broker is deadlocked, it will be considered unhealthy. In
> Kubernetes, that broker will be restarted "soon", and the topics will
> be scheduled to another broker. I probably wouldn't consider a
> whole cluster as "unhealthy" because a single broker was deadlocked.
> Instead, I'd consider a cluster unhealthy when latency/throughput are
> not meeting expectations, which could happen because a broker is
> deadlocked. Further, there is a chance that the deadlock in the broker
> didn't affect the client's producers and consumers, which is yet
> another reason not to failover to another cluster based on a failed
> broker health check.
> 
> I look forward to hearing your definitions of client health.
> 
> Thanks,
> Michael
> 
> 
> 
> On Wed, Jun 22, 2022 at 8:30 AM Cong Zhao <zhaoc...@apache.org> wrote:
> >
> > Yes, there may have multiple clients request the HC at the same time in the 
> > AutoFailover case, so we should add some cache to reduce broker load.
> >
> > On 2022/06/22 12:55:49 Enrico Olivelli wrote:
> > > Il giorno mer 22 giu 2022 alle ore 14:45 Cong Zhao
> > > <zhaoc...@apache.org> ha scritto:
> > > >
> > > > Hi Enrico,
> > > >
> > > > > Also, I would like to understand in which usecase you can use the
> > > > > binary endpoint and not the HTTP endpoint.
> > > >
> > > > We can't use the HTTP endpoint when the client did not have the admin 
> > > > auth to do a health check. but we need it in some cases such as auto 
> > > > failover on the client-side 
> > > > (https://github.com/apache/pulsar/pull/13316#discussion_r773313991)
> > > AutoFailover is a valid use case for me.
> > > Thanks
> > >
> > > >
> > > > > Health Check is good for scripts and for probes, I don't expect a
> > > > > "client application" to run the HC
> > > >
> > > > Adding a health check API to the client-side just to make it easier to 
> > > > use this feature, this check still works on broker.
> > >
> > > makes sense, but usually the HC, like in k8s or in other environments
> > > is run every X seconds and usually not concurrently
> > >
> > > if you have multiple (tens? hundreds?) of Pulsar clients that require
> > > the HC, this will be a big problem,
> > > is this the reason why you want to add some cache to the response of the 
> > > HC ?
> > >
> > > Enrico
> > >
> > >
> > > >
> > > >
> > > > On 2022/06/22 10:19:52 Enrico Olivelli wrote:
> > > > > I believe that this proposal is too broad.
> > > > >
> > > > > the PIP reads about:
> > > > > - adding HEALTHCHECK to the binary protocol
> > > > > - add a HEALTHCHECK cache on the broker
> > > > >
> > > > > Also, I would like to understand in which usecase you can use the
> > > > > binary endpoint and not the HTTP endpoint.
> > > > >
> > > > > Health Check is good for scripts and for probes, I don't expect a
> > > > > "client application" to run the HC
> > > > >
> > > > > Can you please illustrate some practical use cases?
> > > > >
> > > > > Enric
> > > > >
> > > > > Il giorno mer 8 giu 2022 alle ore 05:22 zhaocong <zhaoc...@apache.org>
> > > > > ha scritto:
> > > > > >
> > > > > > Hello Pulsar Community,
> > > > > >
> > > > > >
> > > > > > Here is a PIP to introduce the HEALTH_CHECK command in the binary 
> > > > > > protocol.
> > > > > > I look forward to your feedback.
> > > > > >
> > > > > >
> > > > > > PIP: https://github.com/apache/pulsar/issues/15859
> > > > > >
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Cong Zhao
> > > > >
> > >
>

Re: [DISCUSS] PIP-172: Introduce the HEALTH_CHECK command in the binary protocol

Reply via email to