Re: [DISCUSS] KIP-693: Client-side Circuit Breaker for Partition Write Errors

Guozhang Wang Mon, 12 Apr 2021 10:23:56 -0700

Hello Guoqiang,

This is another interesting ticket that may be also related to the issues
you observed and fixed in your production, if you used sticky partitioner
in producer clients:


https://issues.apache.org/jira/browse/KAFKA-10888


Guozhang


On Wed, Apr 7, 2021 at 11:00 AM Jun Rao <j...@confluent.io.invalid> wrote:

> Hi, George,
>
> A few more comments on the KIP.
>
> 1. It would be useful to motivate the problem a bit more. For example, is
> the KIP trying to solve a transient broker problem (if so, for how long) or
> a permanent broker problem? It would also be useful to list some common
> causes that can slow the broker down.
>
> 2. It would be useful to discuss a bit more on the high level approach
> (e.g. in the rejected section). This KIP proposes to fix the issue on the
> client side by having a pluggable component to redirect the traffic to
> other brokers. One potential issue with this is that it requires all
> clients to opt in (assuming this is not the default) for the plugin to see
> the benefit. In some environments with a large number of clients,
> coordinating all those clients may not be easy. Another potential solution
> is to fix the issue on the server side. For example, if a broker is slow
> because it has noisy neighbors in a virtual environment, we could
> proactively bring down the broker and restart it somewhere else. This has
> the benefit that it requires less client side coordination.
>
> 3. Regarding how to detect broker slowness in the client. The proposal is
> based on the error in the produce response. Typically, if the broker is
> just slow, the only type of error the client gets is the timeout exception.
> Since the default timeout is 30 seconds, it may not be triggered all the
> time and it may be too late to reflect a broker side issue. I am wondering
> if there are other better indicators. For example, another potential option
> is to use the number of pending batches per partition (or broker) in the
> Accumulator. Intuitively, if a broker is slow, all partitions with the
> leader on it will gradually accumulate more batches.
>
> 4. It would be useful to have a solution that works with keyed messages so
> that they can still be distributed to the partition based on the hash of
> the key.
>
> Thanks,
>
> Jun
>
>
> On Wed, Mar 24, 2021 at 4:05 AM Guoqiang Shu <shuguoqi...@gmail.com>
> wrote:
>
> >
> > In our current proposal it can be configured via
> > producer.circuit.breaker.mute.retry.interval (defaulted to 10 mins), but
> > perhaps 'interval' is a confusing name.
> >
> > On 2021/03/23 00:45:23, Guozhang Wang <wangg...@gmail.com> wrote:
> > > Thanks for the updated KIP! Some more comments inlined.
> > > >
> > > > I'm still not sure if, in your proposal, the muting length is a
> > > customizable value (and if yes, through which config) or it is always
> > hard
> > > coded as 10 minutes?
> > >
> > >
> > > > > Guozhang
> >
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-693: Client-side Circuit Breaker for Partition Write Errors

Reply via email to