Re: [DISCUSS] KIP-693: Client-side Circuit Breaker for Partition Write Errors

Jun Rao Wed, 07 Apr 2021 11:00:18 -0700

Hi, George,

A few more comments on the KIP.

1. It would be useful to motivate the problem a bit more. For example, is
the KIP trying to solve a transient broker problem (if so, for how long) or
a permanent broker problem? It would also be useful to list some common
causes that can slow the broker down.

2. It would be useful to discuss a bit more on the high level approach
(e.g. in the rejected section). This KIP proposes to fix the issue on the
client side by having a pluggable component to redirect the traffic to
other brokers. One potential issue with this is that it requires all
clients to opt in (assuming this is not the default) for the plugin to see
the benefit. In some environments with a large number of clients,
coordinating all those clients may not be easy. Another potential solution
is to fix the issue on the server side. For example, if a broker is slow
because it has noisy neighbors in a virtual environment, we could
proactively bring down the broker and restart it somewhere else. This has
the benefit that it requires less client side coordination.

3. Regarding how to detect broker slowness in the client. The proposal is
based on the error in the produce response. Typically, if the broker is
just slow, the only type of error the client gets is the timeout exception.
Since the default timeout is 30 seconds, it may not be triggered all the
time and it may be too late to reflect a broker side issue. I am wondering
if there are other better indicators. For example, another potential option
is to use the number of pending batches per partition (or broker) in the
Accumulator. Intuitively, if a broker is slow, all partitions with the
leader on it will gradually accumulate more batches.

4. It would be useful to have a solution that works with keyed messages so
that they can still be distributed to the partition based on the hash of
the key.

Thanks,

Jun

On Wed, Mar 24, 2021 at 4:05 AM Guoqiang Shu <shuguoqi...@gmail.com> wrote:

>
> In our current proposal it can be configured via
> producer.circuit.breaker.mute.retry.interval (defaulted to 10 mins), but
> perhaps 'interval' is a confusing name.
>
> On 2021/03/23 00:45:23, Guozhang Wang <wangg...@gmail.com> wrote:
> > Thanks for the updated KIP! Some more comments inlined.
> > >
> > > I'm still not sure if, in your proposal, the muting length is a
> > customizable value (and if yes, through which config) or it is always
> hard
> > coded as 10 minutes?
> >
> >
> > > > Guozhang
>
>

Re: [DISCUSS] KIP-693: Client-side Circuit Breaker for Partition Write Errors

Reply via email to