Hello Guoqiang, This is another interesting ticket that may be also related to the issues you observed and fixed in your production, if you used sticky partitioner in producer clients:
https://issues.apache.org/jira/browse/KAFKA-10888 Guozhang On Wed, Apr 7, 2021 at 11:00 AM Jun Rao <j...@confluent.io.invalid> wrote: > Hi, George, > > A few more comments on the KIP. > > 1. It would be useful to motivate the problem a bit more. For example, is > the KIP trying to solve a transient broker problem (if so, for how long) or > a permanent broker problem? It would also be useful to list some common > causes that can slow the broker down. > > 2. It would be useful to discuss a bit more on the high level approach > (e.g. in the rejected section). This KIP proposes to fix the issue on the > client side by having a pluggable component to redirect the traffic to > other brokers. One potential issue with this is that it requires all > clients to opt in (assuming this is not the default) for the plugin to see > the benefit. In some environments with a large number of clients, > coordinating all those clients may not be easy. Another potential solution > is to fix the issue on the server side. For example, if a broker is slow > because it has noisy neighbors in a virtual environment, we could > proactively bring down the broker and restart it somewhere else. This has > the benefit that it requires less client side coordination. > > 3. Regarding how to detect broker slowness in the client. The proposal is > based on the error in the produce response. Typically, if the broker is > just slow, the only type of error the client gets is the timeout exception. > Since the default timeout is 30 seconds, it may not be triggered all the > time and it may be too late to reflect a broker side issue. I am wondering > if there are other better indicators. For example, another potential option > is to use the number of pending batches per partition (or broker) in the > Accumulator. Intuitively, if a broker is slow, all partitions with the > leader on it will gradually accumulate more batches. > > 4. It would be useful to have a solution that works with keyed messages so > that they can still be distributed to the partition based on the hash of > the key. > > Thanks, > > Jun > > > On Wed, Mar 24, 2021 at 4:05 AM Guoqiang Shu <shuguoqi...@gmail.com> > wrote: > > > > > In our current proposal it can be configured via > > producer.circuit.breaker.mute.retry.interval (defaulted to 10 mins), but > > perhaps 'interval' is a confusing name. > > > > On 2021/03/23 00:45:23, Guozhang Wang <wangg...@gmail.com> wrote: > > > Thanks for the updated KIP! Some more comments inlined. > > > > > > > > I'm still not sure if, in your proposal, the muting length is a > > > customizable value (and if yes, through which config) or it is always > > hard > > > coded as 10 minutes? > > > > > > > > > > > Guozhang > > > > > -- -- Guozhang