Re: [DISCUSS] KIP-693: Client-side Circuit Breaker for Partition Write Errors

Guozhang Wang Mon, 22 Mar 2021 17:45:43 -0700

Thanks for the updated KIP! Some more comments inlined.

On Sun, Mar 7, 2021 at 6:43 PM Guoqiang Shu <shuguoqi...@gmail.com> wrote:


>
>
> Guozhang, many thanks for taking a look! Sorry for the late reply, we have
> iterated the prototype on our production setup with your question in mind
> and updated the KIP correspondingly. Below are the answers and the summary
> of the updates
>
> >
> > 1) Why does the default implementation have a criterion of when it is
> > enabled -- i.e. after a certain number of messages have been successfully
> > sent -- instead of always enabled? Also how would this mechanism be
> > customizable in user's own implementations? The current interface APIs
> seem
> > not allow such configurations.
> >
>
> [GS] The enabling condition is to prevent undesirable effect of ‘ratio'
> when the volume is small. And we should allow the enabling condition to be
> customize in an implementation of break. To that end, the
> ProducerCircuitBreaker interface now extends Configurable interface, so it
> supports passing the initialization parameters of KafkaProducer during
> instantiation. The updated KIP contains more details.
>
>
> > 2) When does the "onError" function trigger? Is that triggered when the
> > response returned contains an error code (but maybe a retriable one, so
> > that the batch of records would be re-enqueued) or when the batch of
> > records are finally going to be dropped on the floor? If it is based on
> the
> > first case, then depending on the timeout configuration we may not
> trigger
> > the breaker in time, is that the case? Also I'm wondering if we can allow
> > this function to be triggered on other events, such as inflight.requests
> > exceeding the threshold as well.
> >
>
> [GS] In our original prototype the ‘onError’ function triggers in the
> first case you described, but it only handles ’TimeoutException’ and
> ’NetworkException’. Agree it may cause the muting to be not in time and
> further the upstream buffer to be filled up. With more experiments, we
> updated the interface with two new callbacks:
> onSend(): pass the current KafkaProducer's current network congestion to
> the plug-in interface before sending the message, allowing custom
> implementations to kick in based on network congestion (for example,
> inflight.requests).
> onComplete() is triggered either upon DeliveryTimeout/RequestTimeout or
> when the final result is processed by the sender after max retries is
> reached. As these two situations are complementary we can prevent glitches
> (for example, when inflight.requests periodically falls below the
> threshold).
>
> > 3) How would users customize "how long would we mute a partition" and
> "how
> > many partitions of a topic in total could be muted at the same time" from
> > the interface?
> >
> >
>
> [GS] We believe the muting length is empirical and in our team we use the
> heuristic of process time of single node failure - hence the proposed 10
> minutes default.
> We believe it is reasonable to allow any numbers of partitions to be muted
> at the same time (i.e. no need for the upper limit. In the extreme case of
> a cluster-level failure, the user should be alerted to manually divert
> cluster-level traffic to other normal clusters.
>
> I'm still not sure if, in your proposal, the muting length is a
customizable value (and if yes, through which config) or it is always hard
coded as 10 minutes?


> > Guozhang
> >
> >
> > On Mon, Dec 14, 2020 at 5:42 PM Guoqiang Shu <shuguoqi...@gmail.com>
> wrote:
> >
> > >
> > > Hi Jun and Justin,
> > >
> > > Many thanks for taking a look at our proposal and for the pointer! We
> > > learned about the mechanism proposed to enhance StickyPartitioner. Both
> > > methods aim to exclude brokers with transient errors and prevent
> cluster
> > > wide failure. The difference lies in the criteria used to tell if a
> broker
> > > is problematic: our KIP uses the error condition for the operation;
> and the
> > > heuristic in StickyPartitioner relies on the internal state
> > > max.in.flight.requests.per.connection.
> > >
> > > IMHO, using final result of write operation makes the behavior simpler
> to
> > > reason about. It covers all error scenarios and potentially supports
> all
> > > implementations of Partitioner. In contrast, using intermediate state
> may
> > > trigger action prematurely, for example, when
> > > max.in.flight.requests.per.connection reaches the threshold (due to
> small
> > > linger.ms value)  but the buffer in producer side is still in healthy
> > > state. In addition, in sequential mode
> > > max.in.flight.requests.per.connection is set to 1 therefore cannot be
> > > leveraged.
> > >
> > > Finally, as Justine pointed out, having AvailablePartitions reflects
> > > broker status (as enabled in our KIP) will benefit optimizations of the
> > > Partitioner in general, so the two classes of enhancements can coexist.
> > >
> > > Cheers,
> > > //George//
> > >
> > > On 2020/12/08 18:19:43, Justine Olshan <jols...@confluent.io> wrote:
> > > > Hi George,
> > > > I've been looking at the discussion on improving the sticky
> partitioner,
> > > > and one of the potential issues we discussed is how we could get
> > > > information to the partitioner to tell it not to choose certain
> > > partitions.
> > > > Currently, the partitioner can only use availablePartitionsForTopic.
> I
> > > took
> > > > a quick look at your KIP and it seemed that your KIP would change
> what
> > > > partitions are returned with this method. This seems like a step in
> the
> > > > right direction for solving that issue too.
> > > >
> > > > I agree with Jun that looking at both of these issues and the
> proposed
> > > > solutions would be very helpful.
> > > > Justine
> > > >
> > > > On Tue, Dec 8, 2020 at 10:07 AM Jun Rao <j...@confluent.io> wrote:
> > > >
> > > > > Hi, George,
> > > > >
> > > > > Thanks for submitting the KIP. There was an earlier discussing on
> > > improving
> > > > > the sticky partitioner in the producer (
> > > > >
> > > > >
> > >
> https://lists.apache.org/thread.html/rae8d2d5587dae57ad9093a85181e0cb4256f10d1e57138ecdb3ef287%40%3Cdev.kafka.apache.org%3E
> > > > > ).
> > > > > It seems to be solving a very similar issue. It would be useful to
> > > analyze
> > > > > both approaches and see which one solves the problem better.
> > > > >
> > > > > Jun
> > > > >
> > > > > On Tue, Dec 8, 2020 at 8:05 AM georgeshu(舒国强) <
> george...@tencent.com>
> > > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > We write up a KIP based on a straightforward mechanism
> implemented
> > > and
> > > > > > tested in order to solve a practical issue in production.
> > > > > >
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-693%3A+Client-side+Circuit+Breaker+for+Partition+Write+Errors
> > > > > > Look forward to hearing feedback and suggestions.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-693: Client-side Circuit Breaker for Partition Write Errors

Reply via email to