Thanks for your replies Cong Zhao.

I think the current PIP might need some clarification on how errors
are handled. For example, if a single broker fails to respond because
it was being restarted, how would the client handle that kind of
failure with this feature?

> This is a good definition of cluster health, but we can't check all topics 
> that would add a lot of load on cleint and broker.

I wasn't suggesting that the client would need to ask the broker for
each of the producers/consumers, but rather that the client would
monitor producers/consumers locally and make decisions about cluster
health. For example, if a producer cannot connect to its target topic
after some amount of time or some number of retries, or if a producer
can connect but cannot publish a message successfully within some
amount of time, then the client could consider the cluster to be
unhealthy.

> This proposal mainly provides a means to check whether there is available 
> topic in the cluster, and I think this is meaningful in most cases.

The client will discover if one of its targeted topics is unavailable,
so instead of monitoring the broker's health check topic, I think the
client should monitor/failover when a targeted topic is "unavailable"
for some configured length of time.

I support making the auto-failover logic more robust, but I don't
think the broker health check is the right signal to use for overall
cluster health. In my view, the broker's health check is meant to
signal to orchestrators (like Kubernetes) when a broker ought to be
restarted.

Thanks,
Michael


On Thu, Jun 23, 2022 at 12:35 AM Cong Zhao <zhaoc...@apache.org> wrote:
>
> Hi Michael,
>
> Thanks for your feedback.
>
> > I define a client's primary cluster as "healthy" when it is "healthy"
> for all of its producers and consumers. I define a healthy producer as
> one that can connect to a topic and publish messages within certain
> latency and throughput thresholds (configured by the user), and I
> define a healthy consumer as one that can connect to a topic and
> consume messages when there are messages to be consumed (possibly
> within a certain latency?).
>
> This is a good definition of cluster health, but we can't check all topics 
> that would add a lot of load on cleint and broker.
>
> > By the above definitions, I don't think the broker's health check will
> give us the right notion of "healthy" because that health check
> monitors producing/consuming to/from the health check topic, not the
> client's target topics. One primary difference is that a health check
> topic could have a different persistence policy, which means the
> client could incorrectly classify the broker as healthy when there
> aren't enough available bookies for a producer's target topic.
>
> This proposal mainly provides a means to check whether there is available 
> topic in the cluster, and I think this is meaningful in most cases.
>
> I think if the client implementation doesn't meet the user's needs, they can 
> also override the `healthCheck` method based on the `HEALTH_CHECK` command.
>
> Thanks,
> Cong Zhao
>
> On 2022/06/22 19:06:25 Michael Marshall wrote:
> > I'd like to clarify the motivation for this PIP. My understanding is
> > that the primary motivation is to give clients a robust way to
> > classify a cluster as "healthy". The initial beneficiary of this
> > feature is the auto failover use case. I think the feature makes
> > sense, but before using the broker's concept of "healthy" as defined
> > in the broker health check, I think we should define what constitutes
> > a "healthy cluster" from the client's perspective.
> >
> > I define a client's primary cluster as "healthy" when it is "healthy"
> > for all of its producers and consumers. I define a healthy producer as
> > one that can connect to a topic and publish messages within certain
> > latency and throughput thresholds (configured by the user), and I
> > define a healthy consumer as one that can connect to a topic and
> > consume messages when there are messages to be consumed (possibly
> > within a certain latency?).
> >
> > By the above definitions, I don't think the broker's health check will
> > give us the right notion of "healthy" because that health check
> > monitors producing/consuming to/from the health check topic, not the
> > client's target topics. One primary difference is that a health check
> > topic could have a different persistence policy, which means the
> > client could incorrectly classify the broker as healthy when there
> > aren't enough available bookies for a producer's target topic.
> >
> > The broker health check also includes checks that we probably don't
> > want to use to classify whole clusters as "unhealthy". For example, if
> > the broker is deadlocked, it will be considered unhealthy. In
> > Kubernetes, that broker will be restarted "soon", and the topics will
> > be scheduled to another broker. I probably wouldn't consider a
> > whole cluster as "unhealthy" because a single broker was deadlocked.
> > Instead, I'd consider a cluster unhealthy when latency/throughput are
> > not meeting expectations, which could happen because a broker is
> > deadlocked. Further, there is a chance that the deadlock in the broker
> > didn't affect the client's producers and consumers, which is yet
> > another reason not to failover to another cluster based on a failed
> > broker health check.
> >
> > I look forward to hearing your definitions of client health.
> >
> > Thanks,
> > Michael
> >
> >
> >
> > On Wed, Jun 22, 2022 at 8:30 AM Cong Zhao <zhaoc...@apache.org> wrote:
> > >
> > > Yes, there may have multiple clients request the HC at the same time in 
> > > the AutoFailover case, so we should add some cache to reduce broker load.
> > >
> > > On 2022/06/22 12:55:49 Enrico Olivelli wrote:
> > > > Il giorno mer 22 giu 2022 alle ore 14:45 Cong Zhao
> > > > <zhaoc...@apache.org> ha scritto:
> > > > >
> > > > > Hi Enrico,
> > > > >
> > > > > > Also, I would like to understand in which usecase you can use the
> > > > > > binary endpoint and not the HTTP endpoint.
> > > > >
> > > > > We can't use the HTTP endpoint when the client did not have the admin 
> > > > > auth to do a health check. but we need it in some cases such as auto 
> > > > > failover on the client-side 
> > > > > (https://github.com/apache/pulsar/pull/13316#discussion_r773313991)
> > > > AutoFailover is a valid use case for me.
> > > > Thanks
> > > >
> > > > >
> > > > > > Health Check is good for scripts and for probes, I don't expect a
> > > > > > "client application" to run the HC
> > > > >
> > > > > Adding a health check API to the client-side just to make it easier 
> > > > > to use this feature, this check still works on broker.
> > > >
> > > > makes sense, but usually the HC, like in k8s or in other environments
> > > > is run every X seconds and usually not concurrently
> > > >
> > > > if you have multiple (tens? hundreds?) of Pulsar clients that require
> > > > the HC, this will be a big problem,
> > > > is this the reason why you want to add some cache to the response of 
> > > > the HC ?
> > > >
> > > > Enrico
> > > >
> > > >
> > > > >
> > > > >
> > > > > On 2022/06/22 10:19:52 Enrico Olivelli wrote:
> > > > > > I believe that this proposal is too broad.
> > > > > >
> > > > > > the PIP reads about:
> > > > > > - adding HEALTHCHECK to the binary protocol
> > > > > > - add a HEALTHCHECK cache on the broker
> > > > > >
> > > > > > Also, I would like to understand in which usecase you can use the
> > > > > > binary endpoint and not the HTTP endpoint.
> > > > > >
> > > > > > Health Check is good for scripts and for probes, I don't expect a
> > > > > > "client application" to run the HC
> > > > > >
> > > > > > Can you please illustrate some practical use cases?
> > > > > >
> > > > > > Enric
> > > > > >
> > > > > > Il giorno mer 8 giu 2022 alle ore 05:22 zhaocong 
> > > > > > <zhaoc...@apache.org>
> > > > > > ha scritto:
> > > > > > >
> > > > > > > Hello Pulsar Community,
> > > > > > >
> > > > > > >
> > > > > > > Here is a PIP to introduce the HEALTH_CHECK command in the binary 
> > > > > > > protocol.
> > > > > > > I look forward to your feedback.
> > > > > > >
> > > > > > >
> > > > > > > PIP: https://github.com/apache/pulsar/issues/15859
> > > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Cong Zhao
> > > > > >
> > > >
> >

Reply via email to