Guozhang, many thanks for taking a look! Sorry for the late reply, we have
iterated the prototype on our production setup with your question in mind and
updated the KIP correspondingly. Below are the answers and the summary of the
updates
>
> 1) Why does the default implementation have a criterion of when it is
> enabled -- i.e. after a certain number of messages have been successfully
> sent -- instead of always enabled? Also how would this mechanism be
> customizable in user's own implementations? The current interface APIs seem
> not allow such configurations.
>
[GS] The enabling condition is to prevent undesirable effect of ‘ratio' when
the volume is small. And we should allow the enabling condition to be customize
in an implementation of break. To that end, the ProducerCircuitBreaker
interface now extends Configurable interface, so it supports passing the
initialization parameters of KafkaProducer during instantiation. The updated
KIP contains more details.
> 2) When does the "onError" function trigger? Is that triggered when the
> response returned contains an error code (but maybe a retriable one, so
> that the batch of records would be re-enqueued) or when the batch of
> records are finally going to be dropped on the floor? If it is based on the
> first case, then depending on the timeout configuration we may not trigger
> the breaker in time, is that the case? Also I'm wondering if we can allow
> this function to be triggered on other events, such as inflight.requests
> exceeding the threshold as well.
>
[GS] In our original prototype the ‘onError’ function triggers in the first
case you described, but it only handles ’TimeoutException’ and
’NetworkException’. Agree it may cause the muting to be not in time and further
the upstream buffer to be filled up. With more experiments, we updated the
interface with two new callbacks:
onSend(): pass the current KafkaProducer's current network congestion to the
plug-in interface before sending the message, allowing custom implementations
to kick in based on network congestion (for example, inflight.requests).
onComplete() is triggered either upon DeliveryTimeout/RequestTimeout or when
the final result is processed by the sender after max retries is reached. As
these two situations are complementary we can prevent glitches (for example,
when inflight.requests periodically falls below the threshold).
> 3) How would users customize "how long would we mute a partition" and "how
> many partitions of a topic in total could be muted at the same time" from
> the interface?
>
>
[GS] We believe the muting length is empirical and in our team we use the
heuristic of process time of single node failure - hence the proposed 10
minutes default.
We believe it is reasonable to allow any numbers of partitions to be muted at
the same time (i.e. no need for the upper limit. In the extreme case of a
cluster-level failure, the user should be alerted to manually divert
cluster-level traffic to other normal clusters.
> Guozhang
>
>
> On Mon, Dec 14, 2020 at 5:42 PM Guoqiang Shu <shuguoqi...@gmail.com> wrote:
>
> >
> > Hi Jun and Justin,
> >
> > Many thanks for taking a look at our proposal and for the pointer! We
> > learned about the mechanism proposed to enhance StickyPartitioner. Both
> > methods aim to exclude brokers with transient errors and prevent cluster
> > wide failure. The difference lies in the criteria used to tell if a broker
> > is problematic: our KIP uses the error condition for the operation; and the
> > heuristic in StickyPartitioner relies on the internal state
> > max.in.flight.requests.per.connection.
> >
> > IMHO, using final result of write operation makes the behavior simpler to
> > reason about. It covers all error scenarios and potentially supports all
> > implementations of Partitioner. In contrast, using intermediate state may
> > trigger action prematurely, for example, when
> > max.in.flight.requests.per.connection reaches the threshold (due to small
> > linger.ms value) but the buffer in producer side is still in healthy
> > state. In addition, in sequential mode
> > max.in.flight.requests.per.connection is set to 1 therefore cannot be
> > leveraged.
> >
> > Finally, as Justine pointed out, having AvailablePartitions reflects
> > broker status (as enabled in our KIP) will benefit optimizations of the
> > Partitioner in general, so the two classes of enhancements can coexist.
> >
> > Cheers,
> > //George//
> >
> > On 2020/12/08 18:19:43, Justine Olshan <jols...@confluent.io> wrote:
> > > Hi George,
> > > I've been looking at the discussion on improving the sticky partitioner,
> > > and one of the potential issues we discussed is how we could get
> > > information to the partitioner to tell it not to choose certain
> > partitions.
> > > Currently, the partitioner can only use availablePartitionsForTopic. I
> > took
> > > a quick look at your KIP and it seemed that your KIP would change what
> > > partitions are returned with this method. This seems like a step in the
> > > right direction for solving that issue too.
> > >
> > > I agree with Jun that looking at both of these issues and the proposed
> > > solutions would be very helpful.
> > > Justine
> > >
> > > On Tue, Dec 8, 2020 at 10:07 AM Jun Rao <j...@confluent.io> wrote:
> > >
> > > > Hi, George,
> > > >
> > > > Thanks for submitting the KIP. There was an earlier discussing on
> > improving
> > > > the sticky partitioner in the producer (
> > > >
> > > >
> > https://lists.apache.org/thread.html/rae8d2d5587dae57ad9093a85181e0cb4256f10d1e57138ecdb3ef287%40%3Cdev.kafka.apache.org%3E
> > > > ).
> > > > It seems to be solving a very similar issue. It would be useful to
> > analyze
> > > > both approaches and see which one solves the problem better.
> > > >
> > > > Jun
> > > >
> > > > On Tue, Dec 8, 2020 at 8:05 AM georgeshu(舒国强) <george...@tencent.com>
> > > > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > We write up a KIP based on a straightforward mechanism implemented
> > and
> > > > > tested in order to solve a practical issue in production.
> > > > >
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-693%3A+Client-side+Circuit+Breaker+for+Partition+Write+Errors
> > > > > Look forward to hearing feedback and suggestions.
> > > > >
> > > > > Thanks!
> > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> -- Guozhang
>