Re: [DISCUSS] PIP-172: Introduce the HEALTH_CHECK command in the binary protocol

Michael Marshall Thu, 30 Jun 2022 22:48:29 -0700

Thanks for your response Cong Zhao.

> I think don't need to handle it, because the client API should be consistent 
> with the `HEALTH_CHECK` command result, and users can retry it if they need.


I think I may be coming at this from a different perspective. A client
application does not need to know about the health of an individual
broker, so I don't think we should expose it explicitly through the
client API. A client's primary concern is whether it can send/receive
its messages, and a client's notion of cluster health should only be
based upon whether it can send/receive its messages.

> Also, for the main problem that this proposal wants to solve (how to check a 
> new cluster is healthy). Do you have another better idea?

What are the use cases where a client needs to know the cluster health
other than for auto-failover? The question of cluster health is
complicated, given that Pulsar is a distributed system, and I do not
view the broker health check as a reliable proxy for cluster health. I
view it as an indicator of a single broker's health at a single point
in time.

Thanks,
Michael


On Fri, Jun 24, 2022 at 4:43 AM Cong Zhao <zhaoc...@apache.org> wrote:
>
> Hi Michael,
>
> > I think the current PIP might need some clarification on how errors
> > are handled. For example, if a single broker fails to respond because
> > it was being restarted, how would the client handle that kind of
> > failure with this feature?
>
> I think don't need to handle it, because the client API should be consistent 
> with the `HEALTH_CHECK` command result, and users can retry it if they need.
>
> > I wasn't suggesting that the client would need to ask the broker for
> > each of the producers/consumers, but rather that the client would
> > monitor producers/consumers locally and make decisions about cluster
> > health. For example, if a producer cannot connect to its target topic
> > after some amount of time or some number of retries, or if a producer
> > can connect but cannot publish a message successfully within some
> > amount of time, then the client could consider the cluster to be
> > unhealthy.
> >
> > > This proposal mainly provides a means to check whether there is available 
> > > topic in the cluster, and I think this is meaningful in most cases.
> >
> > The client will discover if one of its targeted topics is unavailable,
> > so instead of monitoring the broker's health check topic, I think the
> > client should monitor/failover when a targeted topic is "unavailable"
> > for some configured length of time.
> >
> > I support making the auto-failover logic more robust, but I don't
> > think the broker health check is the right signal to use for overall
> > cluster health. In my view, the broker's health check is meant to
> > signal to orchestrators (like Kubernetes) when a broker ought to be
> > restarted.
>
> For the currently connected cluster, we really can't think the current topic 
> is unavailable just because the `HEALTH_CHECK` command result is unhealthy, 
> the current means for auto-failover are relatively rude. I think we improve 
> it by adding extra measures such as mentioned above to you, but this doesn't 
> fall within the scope of the proposal.
>
> Also, for the main problem that this proposal wants to solve (how to check a 
> new cluster is healthy). Do you have another better idea?
>
> Thanks,
> Cong Zhao
>
> On 2022/06/24 05:25:46 Michael Marshall wrote:
> > Thanks for your replies Cong Zhao.
> >
> > I think the current PIP might need some clarification on how errors
> > are handled. For example, if a single broker fails to respond because
> > it was being restarted, how would the client handle that kind of
> > failure with this feature?
> >
> > > This is a good definition of cluster health, but we can't check all 
> > > topics that would add a lot of load on cleint and broker.
> >
> > I wasn't suggesting that the client would need to ask the broker for
> > each of the producers/consumers, but rather that the client would
> > monitor producers/consumers locally and make decisions about cluster
> > health. For example, if a producer cannot connect to its target topic
> > after some amount of time or some number of retries, or if a producer
> > can connect but cannot publish a message successfully within some
> > amount of time, then the client could consider the cluster to be
> > unhealthy.
> >
> > > This proposal mainly provides a means to check whether there is available 
> > > topic in the cluster, and I think this is meaningful in most cases.
> >
> > The client will discover if one of its targeted topics is unavailable,
> > so instead of monitoring the broker's health check topic, I think the
> > client should monitor/failover when a targeted topic is "unavailable"
> > for some configured length of time.
> >
> > I support making the auto-failover logic more robust, but I don't
> > think the broker health check is the right signal to use for overall
> > cluster health. In my view, the broker's health check is meant to
> > signal to orchestrators (like Kubernetes) when a broker ought to be
> > restarted.
> >
> > Thanks,
> > Michael
> >
> >
> > On Thu, Jun 23, 2022 at 12:35 AM Cong Zhao <zhaoc...@apache.org> wrote:
> > >
> > > Hi Michael,
> > >
> > > Thanks for your feedback.
> > >
> > > > I define a client's primary cluster as "healthy" when it is "healthy"
> > > for all of its producers and consumers. I define a healthy producer as
> > > one that can connect to a topic and publish messages within certain
> > > latency and throughput thresholds (configured by the user), and I
> > > define a healthy consumer as one that can connect to a topic and
> > > consume messages when there are messages to be consumed (possibly
> > > within a certain latency?).
> > >
> > > This is a good definition of cluster health, but we can't check all 
> > > topics that would add a lot of load on cleint and broker.
> > >
> > > > By the above definitions, I don't think the broker's health check will
> > > give us the right notion of "healthy" because that health check
> > > monitors producing/consuming to/from the health check topic, not the
> > > client's target topics. One primary difference is that a health check
> > > topic could have a different persistence policy, which means the
> > > client could incorrectly classify the broker as healthy when there
> > > aren't enough available bookies for a producer's target topic.
> > >
> > > This proposal mainly provides a means to check whether there is available 
> > > topic in the cluster, and I think this is meaningful in most cases.
> > >
> > > I think if the client implementation doesn't meet the user's needs, they 
> > > can also override the `healthCheck` method based on the `HEALTH_CHECK` 
> > > command.
> > >
> > > Thanks,
> > > Cong Zhao
> > >
> > > On 2022/06/22 19:06:25 Michael Marshall wrote:
> > > > I'd like to clarify the motivation for this PIP. My understanding is
> > > > that the primary motivation is to give clients a robust way to
> > > > classify a cluster as "healthy". The initial beneficiary of this
> > > > feature is the auto failover use case. I think the feature makes
> > > > sense, but before using the broker's concept of "healthy" as defined
> > > > in the broker health check, I think we should define what constitutes
> > > > a "healthy cluster" from the client's perspective.
> > > >
> > > > I define a client's primary cluster as "healthy" when it is "healthy"
> > > > for all of its producers and consumers. I define a healthy producer as
> > > > one that can connect to a topic and publish messages within certain
> > > > latency and throughput thresholds (configured by the user), and I
> > > > define a healthy consumer as one that can connect to a topic and
> > > > consume messages when there are messages to be consumed (possibly
> > > > within a certain latency?).
> > > >
> > > > By the above definitions, I don't think the broker's health check will
> > > > give us the right notion of "healthy" because that health check
> > > > monitors producing/consuming to/from the health check topic, not the
> > > > client's target topics. One primary difference is that a health check
> > > > topic could have a different persistence policy, which means the
> > > > client could incorrectly classify the broker as healthy when there
> > > > aren't enough available bookies for a producer's target topic.
> > > >
> > > > The broker health check also includes checks that we probably don't
> > > > want to use to classify whole clusters as "unhealthy". For example, if
> > > > the broker is deadlocked, it will be considered unhealthy. In
> > > > Kubernetes, that broker will be restarted "soon", and the topics will
> > > > be scheduled to another broker. I probably wouldn't consider a
> > > > whole cluster as "unhealthy" because a single broker was deadlocked.
> > > > Instead, I'd consider a cluster unhealthy when latency/throughput are
> > > > not meeting expectations, which could happen because a broker is
> > > > deadlocked. Further, there is a chance that the deadlock in the broker
> > > > didn't affect the client's producers and consumers, which is yet
> > > > another reason not to failover to another cluster based on a failed
> > > > broker health check.
> > > >
> > > > I look forward to hearing your definitions of client health.
> > > >
> > > > Thanks,
> > > > Michael
> > > >
> > > >
> > > >
> > > > On Wed, Jun 22, 2022 at 8:30 AM Cong Zhao <zhaoc...@apache.org> wrote:
> > > > >
> > > > > Yes, there may have multiple clients request the HC at the same time 
> > > > > in the AutoFailover case, so we should add some cache to reduce 
> > > > > broker load.
> > > > >
> > > > > On 2022/06/22 12:55:49 Enrico Olivelli wrote:
> > > > > > Il giorno mer 22 giu 2022 alle ore 14:45 Cong Zhao
> > > > > > <zhaoc...@apache.org> ha scritto:
> > > > > > >
> > > > > > > Hi Enrico,
> > > > > > >
> > > > > > > > Also, I would like to understand in which usecase you can use 
> > > > > > > > the
> > > > > > > > binary endpoint and not the HTTP endpoint.
> > > > > > >
> > > > > > > We can't use the HTTP endpoint when the client did not have the 
> > > > > > > admin auth to do a health check. but we need it in some cases 
> > > > > > > such as auto failover on the client-side 
> > > > > > > (https://github.com/apache/pulsar/pull/13316#discussion_r773313991)
> > > > > > AutoFailover is a valid use case for me.
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > > > Health Check is good for scripts and for probes, I don't expect 
> > > > > > > > a
> > > > > > > > "client application" to run the HC
> > > > > > >
> > > > > > > Adding a health check API to the client-side just to make it 
> > > > > > > easier to use this feature, this check still works on broker.
> > > > > >
> > > > > > makes sense, but usually the HC, like in k8s or in other 
> > > > > > environments
> > > > > > is run every X seconds and usually not concurrently
> > > > > >
> > > > > > if you have multiple (tens? hundreds?) of Pulsar clients that 
> > > > > > require
> > > > > > the HC, this will be a big problem,
> > > > > > is this the reason why you want to add some cache to the response 
> > > > > > of the HC ?
> > > > > >
> > > > > > Enrico
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2022/06/22 10:19:52 Enrico Olivelli wrote:
> > > > > > > > I believe that this proposal is too broad.
> > > > > > > >
> > > > > > > > the PIP reads about:
> > > > > > > > - adding HEALTHCHECK to the binary protocol
> > > > > > > > - add a HEALTHCHECK cache on the broker
> > > > > > > >
> > > > > > > > Also, I would like to understand in which usecase you can use 
> > > > > > > > the
> > > > > > > > binary endpoint and not the HTTP endpoint.
> > > > > > > >
> > > > > > > > Health Check is good for scripts and for probes, I don't expect 
> > > > > > > > a
> > > > > > > > "client application" to run the HC
> > > > > > > >
> > > > > > > > Can you please illustrate some practical use cases?
> > > > > > > >
> > > > > > > > Enric
> > > > > > > >
> > > > > > > > Il giorno mer 8 giu 2022 alle ore 05:22 zhaocong 
> > > > > > > > <zhaoc...@apache.org>
> > > > > > > > ha scritto:
> > > > > > > > >
> > > > > > > > > Hello Pulsar Community,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Here is a PIP to introduce the HEALTH_CHECK command in the 
> > > > > > > > > binary protocol.
> > > > > > > > > I look forward to your feedback.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > PIP: https://github.com/apache/pulsar/issues/15859
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > >
> > > > > > > > > Cong Zhao
> > > > > > > >
> > > > > >
> > > >
> >

Re: [DISCUSS] PIP-172: Introduce the HEALTH_CHECK command in the binary protocol

Reply via email to