Re: [DISCUSS] PIP-172: Introduce the HEALTH_CHECK command in the binary protocol

Michael Marshall Wed, 22 Jun 2022 12:07:08 -0700

I'd like to clarify the motivation for this PIP. My understanding is
that the primary motivation is to give clients a robust way to
classify a cluster as "healthy". The initial beneficiary of this
feature is the auto failover use case. I think the feature makes
sense, but before using the broker's concept of "healthy" as defined
in the broker health check, I think we should define what constitutes
a "healthy cluster" from the client's perspective.

I define a client's primary cluster as "healthy" when it is "healthy"
for all of its producers and consumers. I define a healthy producer as
one that can connect to a topic and publish messages within certain
latency and throughput thresholds (configured by the user), and I
define a healthy consumer as one that can connect to a topic and
consume messages when there are messages to be consumed (possibly
within a certain latency?).

By the above definitions, I don't think the broker's health check will
give us the right notion of "healthy" because that health check
monitors producing/consuming to/from the health check topic, not the
client's target topics. One primary difference is that a health check
topic could have a different persistence policy, which means the
client could incorrectly classify the broker as healthy when there
aren't enough available bookies for a producer's target topic.

The broker health check also includes checks that we probably don't
want to use to classify whole clusters as "unhealthy". For example, if
the broker is deadlocked, it will be considered unhealthy. In
Kubernetes, that broker will be restarted "soon", and the topics will
be scheduled to another broker. I probably wouldn't consider a
whole cluster as "unhealthy" because a single broker was deadlocked.
Instead, I'd consider a cluster unhealthy when latency/throughput are
not meeting expectations, which could happen because a broker is
deadlocked. Further, there is a chance that the deadlock in the broker
didn't affect the client's producers and consumers, which is yet
another reason not to failover to another cluster based on a failed
broker health check.

I look forward to hearing your definitions of client health.

Thanks,
Michael

On Wed, Jun 22, 2022 at 8:30 AM Cong Zhao <[email protected]> wrote:
>
> Yes, there may have multiple clients request the HC at the same time in the 
> AutoFailover case, so we should add some cache to reduce broker load.
>
> On 2022/06/22 12:55:49 Enrico Olivelli wrote:
> > Il giorno mer 22 giu 2022 alle ore 14:45 Cong Zhao
> > <[email protected]> ha scritto:
> > >
> > > Hi Enrico,
> > >
> > > > Also, I would like to understand in which usecase you can use the
> > > > binary endpoint and not the HTTP endpoint.
> > >
> > > We can't use the HTTP endpoint when the client did not have the admin 
> > > auth to do a health check. but we need it in some cases such as auto 
> > > failover on the client-side 
> > > (https://github.com/apache/pulsar/pull/13316#discussion_r773313991)
> > AutoFailover is a valid use case for me.
> > Thanks
> >
> > >
> > > > Health Check is good for scripts and for probes, I don't expect a
> > > > "client application" to run the HC
> > >
> > > Adding a health check API to the client-side just to make it easier to 
> > > use this feature, this check still works on broker.
> >
> > makes sense, but usually the HC, like in k8s or in other environments
> > is run every X seconds and usually not concurrently
> >
> > if you have multiple (tens? hundreds?) of Pulsar clients that require
> > the HC, this will be a big problem,
> > is this the reason why you want to add some cache to the response of the HC 
> > ?
> >
> > Enrico
> >
> >
> > >
> > >
> > > On 2022/06/22 10:19:52 Enrico Olivelli wrote:
> > > > I believe that this proposal is too broad.
> > > >
> > > > the PIP reads about:
> > > > - adding HEALTHCHECK to the binary protocol
> > > > - add a HEALTHCHECK cache on the broker
> > > >
> > > > Also, I would like to understand in which usecase you can use the
> > > > binary endpoint and not the HTTP endpoint.
> > > >
> > > > Health Check is good for scripts and for probes, I don't expect a
> > > > "client application" to run the HC
> > > >
> > > > Can you please illustrate some practical use cases?
> > > >
> > > > Enric
> > > >
> > > > Il giorno mer 8 giu 2022 alle ore 05:22 zhaocong <[email protected]>
> > > > ha scritto:
> > > > >
> > > > > Hello Pulsar Community,
> > > > >
> > > > >
> > > > > Here is a PIP to introduce the HEALTH_CHECK command in the binary 
> > > > > protocol.
> > > > > I look forward to your feedback.
> > > > >
> > > > >
> > > > > PIP: https://github.com/apache/pulsar/issues/15859
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Cong Zhao
> > > >
> >

Re: [DISCUSS] PIP-172: Introduce the HEALTH_CHECK command in the binary protocol

Reply via email to