Thanks for the updated KIP! Some more comments inlined. On Sun, Mar 7, 2021 at 6:43 PM Guoqiang Shu <shuguoqi...@gmail.com> wrote:
> > > Guozhang, many thanks for taking a look! Sorry for the late reply, we have > iterated the prototype on our production setup with your question in mind > and updated the KIP correspondingly. Below are the answers and the summary > of the updates > > > > > 1) Why does the default implementation have a criterion of when it is > > enabled -- i.e. after a certain number of messages have been successfully > > sent -- instead of always enabled? Also how would this mechanism be > > customizable in user's own implementations? The current interface APIs > seem > > not allow such configurations. > > > > [GS] The enabling condition is to prevent undesirable effect of ‘ratio' > when the volume is small. And we should allow the enabling condition to be > customize in an implementation of break. To that end, the > ProducerCircuitBreaker interface now extends Configurable interface, so it > supports passing the initialization parameters of KafkaProducer during > instantiation. The updated KIP contains more details. > > > > 2) When does the "onError" function trigger? Is that triggered when the > > response returned contains an error code (but maybe a retriable one, so > > that the batch of records would be re-enqueued) or when the batch of > > records are finally going to be dropped on the floor? If it is based on > the > > first case, then depending on the timeout configuration we may not > trigger > > the breaker in time, is that the case? Also I'm wondering if we can allow > > this function to be triggered on other events, such as inflight.requests > > exceeding the threshold as well. > > > > [GS] In our original prototype the ‘onError’ function triggers in the > first case you described, but it only handles ’TimeoutException’ and > ’NetworkException’. Agree it may cause the muting to be not in time and > further the upstream buffer to be filled up. With more experiments, we > updated the interface with two new callbacks: > onSend(): pass the current KafkaProducer's current network congestion to > the plug-in interface before sending the message, allowing custom > implementations to kick in based on network congestion (for example, > inflight.requests). > onComplete() is triggered either upon DeliveryTimeout/RequestTimeout or > when the final result is processed by the sender after max retries is > reached. As these two situations are complementary we can prevent glitches > (for example, when inflight.requests periodically falls below the > threshold). > > > 3) How would users customize "how long would we mute a partition" and > "how > > many partitions of a topic in total could be muted at the same time" from > > the interface? > > > > > > [GS] We believe the muting length is empirical and in our team we use the > heuristic of process time of single node failure - hence the proposed 10 > minutes default. > We believe it is reasonable to allow any numbers of partitions to be muted > at the same time (i.e. no need for the upper limit. In the extreme case of > a cluster-level failure, the user should be alerted to manually divert > cluster-level traffic to other normal clusters. > > I'm still not sure if, in your proposal, the muting length is a customizable value (and if yes, through which config) or it is always hard coded as 10 minutes? > > Guozhang > > > > > > On Mon, Dec 14, 2020 at 5:42 PM Guoqiang Shu <shuguoqi...@gmail.com> > wrote: > > > > > > > > Hi Jun and Justin, > > > > > > Many thanks for taking a look at our proposal and for the pointer! We > > > learned about the mechanism proposed to enhance StickyPartitioner. Both > > > methods aim to exclude brokers with transient errors and prevent > cluster > > > wide failure. The difference lies in the criteria used to tell if a > broker > > > is problematic: our KIP uses the error condition for the operation; > and the > > > heuristic in StickyPartitioner relies on the internal state > > > max.in.flight.requests.per.connection. > > > > > > IMHO, using final result of write operation makes the behavior simpler > to > > > reason about. It covers all error scenarios and potentially supports > all > > > implementations of Partitioner. In contrast, using intermediate state > may > > > trigger action prematurely, for example, when > > > max.in.flight.requests.per.connection reaches the threshold (due to > small > > > linger.ms value) but the buffer in producer side is still in healthy > > > state. In addition, in sequential mode > > > max.in.flight.requests.per.connection is set to 1 therefore cannot be > > > leveraged. > > > > > > Finally, as Justine pointed out, having AvailablePartitions reflects > > > broker status (as enabled in our KIP) will benefit optimizations of the > > > Partitioner in general, so the two classes of enhancements can coexist. > > > > > > Cheers, > > > //George// > > > > > > On 2020/12/08 18:19:43, Justine Olshan <jols...@confluent.io> wrote: > > > > Hi George, > > > > I've been looking at the discussion on improving the sticky > partitioner, > > > > and one of the potential issues we discussed is how we could get > > > > information to the partitioner to tell it not to choose certain > > > partitions. > > > > Currently, the partitioner can only use availablePartitionsForTopic. > I > > > took > > > > a quick look at your KIP and it seemed that your KIP would change > what > > > > partitions are returned with this method. This seems like a step in > the > > > > right direction for solving that issue too. > > > > > > > > I agree with Jun that looking at both of these issues and the > proposed > > > > solutions would be very helpful. > > > > Justine > > > > > > > > On Tue, Dec 8, 2020 at 10:07 AM Jun Rao <j...@confluent.io> wrote: > > > > > > > > > Hi, George, > > > > > > > > > > Thanks for submitting the KIP. There was an earlier discussing on > > > improving > > > > > the sticky partitioner in the producer ( > > > > > > > > > > > > > > https://lists.apache.org/thread.html/rae8d2d5587dae57ad9093a85181e0cb4256f10d1e57138ecdb3ef287%40%3Cdev.kafka.apache.org%3E > > > > > ). > > > > > It seems to be solving a very similar issue. It would be useful to > > > analyze > > > > > both approaches and see which one solves the problem better. > > > > > > > > > > Jun > > > > > > > > > > On Tue, Dec 8, 2020 at 8:05 AM georgeshu(舒国强) < > george...@tencent.com> > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > We write up a KIP based on a straightforward mechanism > implemented > > > and > > > > > > tested in order to solve a practical issue in production. > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-693%3A+Client-side+Circuit+Breaker+for+Partition+Write+Errors > > > > > > Look forward to hearing feedback and suggestions. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > -- Guozhang > > > -- -- Guozhang