Re: [DISCUSS] Single broker failures causing offline partitions

Martin Dickson Tue, 17 Sep 2024 03:56:13 -0700

I created https://issues.apache.org/jira/browse/KAFKA-17562 for this.


Best,
Martin

On Wed, Sep 11, 2024 at 5:29 PM Martin Dickson <martin.dick...@datadoghq.com>
wrote:

> Hi Justine and Calvin,
>
> Thank you both for the feedback, I agree that KIP-966 should solve #2.
> I'll file a JIRA ticket for #1 (currently in the process of registering my
> account).
>
> Best,
> Martin
>
> On Tue, Sep 10, 2024 at 9:36 PM Justine Olshan
> <jols...@confluent.io.invalid> wrote:
>
>> Hey Calvin and Martin,
>>
>> Makes sense. So KIP-966 can help with 2, but 1 (some mechanism to identify
>> the issue) is still required.
>> If you haven't already filed a JIRA ticket for this, do you mind doing so?
>> I think it makes sense to close this gap.
>>
>> Justine
>>
>> On Tue, Sep 10, 2024 at 1:15 PM Calvin Liu <ca...@confluent.io.invalid>
>> wrote:
>>
>> > Hi Martin.
>> > Yes, the KIP-966 does not resolve your concern for the degraded leader.
>> To
>> > fill the gap,
>> > 1. As you have mentioned, we need a leader degradation detection
>> mechanism.
>> > So that the controller can promote another replica to the leader.
>> > 2. The controller needs to know which replica is a valid candidate to be
>> > the leader. ELR could be a good candidate(KIP-966 currently targets
>> 4.0).
>> > It is a very interesting problem, probably the community can pick this
>> up
>> > after 4.0.
>> >
>> > Calvin
>> >
>> > On Tue, Sep 10, 2024 at 9:37 AM Martin Dickson
>> > <martin.dick...@datadoghq.com.invalid> wrote:
>> >
>> > > Hi Justine,
>> > >
>> > > Indeed we are very much interested in the improvements that will come
>> > from
>> > > KIP-966!
>> > >
>> > > However I think there is still a gap regarding the failure detection
>> of
>> > the
>> > > leader. Please correct me if this is wrong but my understanding is
>> that
>> > > with KIP-966 we'll stop advancing the HWM minISR isn't satisfied, and
>> > will
>> > > be able to fail-over leadership to any member of ELR following
>> complete
>> > > failure of the leader. This gives good options for recovery following
>> a
>> > > complete failure, however, if the leader remains degraded but doesn't
>> > > completely fail then the partition will stay online but will still be
>> > > unavailable for writes. (So probably a better subject for my email
>> would
>> > > have been "Single degraded brokers causing unavailable partitions",
>> which
>> > > is an issue which I think remains post KIP-966?)
>> > >
>> > > Best,
>> > > Martin
>> > >
>> > >
>> > > On Tue, Sep 10, 2024 at 5:14 PM Justine Olshan
>> > > <jols...@confluent.io.invalid>
>> > > wrote:
>> > >
>> > > > Hey Martin.
>> > > >
>> > > > I recommend you take a look at KIP-966. I think can help the use
>> case
>> > you
>> > > > are describing.
>> > > > The KIP talks about failure scenarios, but I believe it will also
>> help
>> > > when
>> > > > the leader has issues and kicks its followers out of ISR.
>> > > > The goal is to better handle the "last replica standing" issue
>> > > >
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
>> > > >
>> > > > Let me know if it helps,
>> > > >
>> > > > Justine
>> > > >
>> > > > On Tue, Sep 10, 2024 at 9:00 AM Martin Dickson
>> > > > <martin.dick...@datadoghq.com.invalid> wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > We have a recurring issue with single broker failures causing
>> offline
>> > > > > partitions. The issue is that when a leader is degraded, follower
>> > > fetches
>> > > > > can fail to happen in a timely manner, and all followers can fall
>> out
>> > > of
>> > > > > sync. If that leader then later fails then the partition will go
>> > > offline,
>> > > > > but even if it remains only partially failed then applications
>> might
>> > > > still
>> > > > > be impacted (for example, if the producer is using acks=all and
>> > > > > min.insync.replicas=2). This can all happen because of a problem
>> > solely
>> > > > > with the leader, and hence a single broker failure can cause
>> > > > > unavailability, even if RF=3 or higher.
>> > > > >
>> > > > > We’ve seen the issue with various kinds of failures, mostly
>> related
>> > to
>> > > > > failing disks, e.g. due to pressure on request handler threads as
>> a
>> > > > result
>> > > > > of produce requests waiting on a slow disk. But the easiest way
>> for
>> > us
>> > > to
>> > > > > reproduce it is at the outgoing network level: Setting up a
>> cluster
>> > > with
>> > > > > moderate levels of ingoing throughput then injecting 50% outgoing
>> > > packet
>> > > > > drop on a single broker is enough to cause the partitions to cause
>> > > > follower
>> > > > > requests to be slow and replication to lag, but not enough for
>> that
>> > > > broker
>> > > > > to lose its connection to ZK. This triggers the degraded broker to
>> > > become
>> > > > > the only member of ISR.
>> > > > >
>> > > > > We have replica.lag.time.max.ms=10000 and
>> > zookeeper.session.timeout.ms
>> > > > > =6000
>> > > > > (the pre-KIP-537 values, 1/3 of the current defaults, to control
>> > > produce
>> > > > > latency when a follower is failing). We are also able to reproduce
>> > the
>> > > > > issue in the same way on a KRaft cluster with the KRaft defaults.
>> > (Note
>> > > > > that we are not very experienced with operating KRaft as we aren’t
>> > > > running
>> > > > > it in production yet.)
>> > > > >
>> > > > > The last KIP I saw regarding this was KIP-501
>> > > > > <
>> > > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-501+Avoid+out-of-sync+or+offline+partitions+when+follower+fetch+requests+are+not+processed+in+time
>> > > > > >,
>> > > > > which describes this exact problem. The proposed solution there
>> was
>> > in
>> > > > the
>> > > > > first part to introduce a notion of pending requests, and second
>> part
>> > > to
>> > > > > relinquish leadership if pending requests are taking too long. The
>> > > > > discussion
>> > > > > thread <
>> > > https://lists.apache.org/thread/1kbs68dq60p31frpfsr3x1vcqlzjf60x
>> > > > >
>> > > > > for that doesn’t come to a conclusion. However it is pointed out
>> that
>> > > not
>> > > > > all failure modes would be solved by the pending requests
>> approach,
>> > and
>> > > > > that whilst relinquishing leadership seems ideal there are
>> concerns
>> > > about
>> > > > > this thrashing in certain failure modes.
>> > > > >
>> > > > > We are experimenting with a variation on KIP-501 where we add a
>> > > heuristic
>> > > > > for brokers failing this way: if the leader for a partition has
>> > removed
>> > > > > many followers from ISR in a short period of time (including the
>> case
>> > > > when
>> > > > > it sends a single AlterPartition request removing all followers
>> from
>> > > ISR
>> > > > > and thereby shrinking ISR only to itself), have the controller
>> ignore
>> > > > this
>> > > > > request and instead choose one of the followers to become leader.
>> To
>> > > > avoid
>> > > > > thrashing, rate-limit how often the controller can do this per
>> > > > > topic-partition. We have tested that this fixes our repro, but
>> have
>> > not
>> > > > > productionised it (see rough draft PR
>> > > > > <https://github.com/DataDog/kafka/pull/15/files>). We have only
>> > > > > implemented
>> > > > > ZK-mode so far. We implemented this on the controller side out of
>> > > > > convenience (no API changes), but potentially the demotion
>> decision
>> > > > should
>> > > > > be taken at the broker level instead, which should also be
>> possible.
>> > > > >
>> > > > > Whilst the code change is small, the proposed solution we’re
>> > > > investigating
>> > > > > isn’t very clean and we’re not totally satisfied with it. We
>> wanted
>> > to
>> > > > get
>> > > > > some ideas from the community on:
>> > > > > 1. How are other folks handling this class of issues?
>> > > > > 2. Is there any interest in adding more comprehensive failing
>> > > > > broker detection to Kafka (particularly how this could look in
>> > KRaft)?
>> > > > > 3. Is there any interest in having a heuristic failure detection
>> like
>> > > the
>> > > > > one described above?
>> > > > >
>> > > > > Thanks,
>> > > > > Martin
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] Single broker failures causing offline partitions

Reply via email to