Hi, George, A few more comments on the KIP.
1. It would be useful to motivate the problem a bit more. For example, is the KIP trying to solve a transient broker problem (if so, for how long) or a permanent broker problem? It would also be useful to list some common causes that can slow the broker down. 2. It would be useful to discuss a bit more on the high level approach (e.g. in the rejected section). This KIP proposes to fix the issue on the client side by having a pluggable component to redirect the traffic to other brokers. One potential issue with this is that it requires all clients to opt in (assuming this is not the default) for the plugin to see the benefit. In some environments with a large number of clients, coordinating all those clients may not be easy. Another potential solution is to fix the issue on the server side. For example, if a broker is slow because it has noisy neighbors in a virtual environment, we could proactively bring down the broker and restart it somewhere else. This has the benefit that it requires less client side coordination. 3. Regarding how to detect broker slowness in the client. The proposal is based on the error in the produce response. Typically, if the broker is just slow, the only type of error the client gets is the timeout exception. Since the default timeout is 30 seconds, it may not be triggered all the time and it may be too late to reflect a broker side issue. I am wondering if there are other better indicators. For example, another potential option is to use the number of pending batches per partition (or broker) in the Accumulator. Intuitively, if a broker is slow, all partitions with the leader on it will gradually accumulate more batches. 4. It would be useful to have a solution that works with keyed messages so that they can still be distributed to the partition based on the hash of the key. Thanks, Jun On Wed, Mar 24, 2021 at 4:05 AM Guoqiang Shu <shuguoqi...@gmail.com> wrote: > > In our current proposal it can be configured via > producer.circuit.breaker.mute.retry.interval (defaulted to 10 mins), but > perhaps 'interval' is a confusing name. > > On 2021/03/23 00:45:23, Guozhang Wang <wangg...@gmail.com> wrote: > > Thanks for the updated KIP! Some more comments inlined. > > > > > > I'm still not sure if, in your proposal, the muting length is a > > customizable value (and if yes, through which config) or it is always > hard > > coded as 10 minutes? > > > > > > > > Guozhang > >