Re: [DISCUSS] Single broker failures causing offline partitions

Haruki Okada Wed, 11 Sep 2024 02:59:10 -0700

Hi Kamal,

> Is the leader election automated to find the replica with the highest
offset and latest epoch?
> And, finding the eligible replica manually, will increase the outage
mitigation time


That's right, but this procedure is not yet automated in our operation.
Because, we found that "choosing highest epoch with latest epoch" is still
NOT safe and may lose committed messages in rare edge case if there are
simultaneous preferred leader switch (which might be the case in our
deployment, because we enable CruiseControl).

We used formal methods to prove this.
-
https://speakerdeck.com/line_developers/the-application-of-formal-methods-in-kafka-reliability-engineering
- https://github.com/ocadaruma/kafka-spec

So I believe we need KIP-966.

2024年9月11日(水) 15:55 Kamal Chandraprakash <kamal.chandraprak...@gmail.com>:

> Hi Haruki,
>
> We are also interested in this issue.
>
> > The problem is how to identify such "eligible" replicas.
>
> Is the leader election automated to find the replica with the highest
> offset and latest epoch?
> If yes, could you please open a PR for it?
>
> When a broker goes down, it might be serving leadership for 1000s of
> partitions.
> And, finding the eligible replica manually, will increase the outage
> mitigation time
> as the producers/consumers are blocked when there are offline partitions.
>
> --
> Kamal
>
>
> On Wed, Sep 11, 2024 at 3:57 AM Haruki Okada <ocadar...@gmail.com> wrote:
>
> > Hi Martin,
> >
> > Thank you for bringing up this issue.
> >
> > We suffer from this "single-broker failure causing unavailable partition"
> > issue due to the disk failure for years too! Because we use HDDs and HDDs
> > tend to cause high disk latency (tens~ of seconds) easily on disk glitch,
> > which often blocks request-handler threads and making it unable to handle
> > fetch-requests, then kicking followers out of ISRs.
> >
> > I believe solving the issue fundamentally is impossible unless we stop
> > relying on external quorum (either KRaft or ZK) for failure
> > detection/leader election and move to quorum-based data replication,
> which
> > is not currently planned in Kafka.
> >
> > Let me share some of our experiences on how to address this problem.
> >
> > ## Proactive disk replacement / broker removal
> > This is kind of dumb solution but we monitor disk health (e.g. Physical
> > disk error counts under RAID) and replace disks or remove brokers
> > proactively before it gets worse.
> >
> > ## Mitigate disk failure impact to the broker functionality
> > In the first place, basically Kafka is page-cache intensive so
> disk-latency
> > impacting the broker so much is unexpected.
> > We found there are some call paths which disk-latency impact amplifies
> and
> > we fixed them.
> >
> > - https://github.com/apache/kafka/pull/14289
> >     * This is a heuristic to mitigate KAFKA-7504, the network-thread
> > blocking issue on catch-up reads which may impact many clients (including
> > followers)
> >     * Not merged to upstream yet but we run this patch on production for
> > years.
> > - https://github.com/apache/kafka/pull/14242
> >     * This is a patch to stop calling fsync under Log#lock, which may
> cause
> > all request handler threads to exhaust easily due to the lock contention
> > when one thread is executing fsync. (disk latency directly impacts fsync
> > latency)
> >
> > ## Prepare offline-partition-handling manual as the last resort
> > Even with above efforts, unavailable-partition still may occur so we
> > prepared (manual) runbook for such situations.
> > Essentially, this is a procedure to do KIP-966 manually.
> >
> > We use acks=all and min.insync.replicas=2 on all partition, which means
> > there should be one "eligible" (i.e. have all committed messages) replica
> > even after a partition goes offline.
> > The problem is how to identify such "eligible" replicas.
> >
> > If we can still login to the last leader, we can just check if the
> > log-suffix matches. (by DumpLogSegments tool...)
> > What about if the last leader completely fails and unable to login?
> > In this case, we check the remaining two replicas' log segments and
> decide
> > one has a longer log as the "eligible" replica, as long as they have the
> > same leader epoch.
> > (NOTE: Checking leader epoch is necessary, because in case leader is
> > changing around the incidental time, "replica with higher offset" and
> > "replica with all committed messages" may not match)
> >
> >
> > Hope these help you.
> >
> > 2024年9月11日(水) 5:36 Justine Olshan <jols...@confluent.io.invalid>:
> >
> > > Hey Calvin and Martin,
> > >
> > > Makes sense. So KIP-966 can help with 2, but 1 (some mechanism to
> > identify
> > > the issue) is still required.
> > > If you haven't already filed a JIRA ticket for this, do you mind doing
> > so?
> > > I think it makes sense to close this gap.
> > >
> > > Justine
> > >
> > > On Tue, Sep 10, 2024 at 1:15 PM Calvin Liu <ca...@confluent.io.invalid
> >
> > > wrote:
> > >
> > > > Hi Martin.
> > > > Yes, the KIP-966 does not resolve your concern for the degraded
> leader.
> > > To
> > > > fill the gap,
> > > > 1. As you have mentioned, we need a leader degradation detection
> > > mechanism.
> > > > So that the controller can promote another replica to the leader.
> > > > 2. The controller needs to know which replica is a valid candidate to
> > be
> > > > the leader. ELR could be a good candidate(KIP-966 currently targets
> > 4.0).
> > > > It is a very interesting problem, probably the community can pick
> this
> > up
> > > > after 4.0.
> > > >
> > > > Calvin
> > > >
> > > > On Tue, Sep 10, 2024 at 9:37 AM Martin Dickson
> > > > <martin.dick...@datadoghq.com.invalid> wrote:
> > > >
> > > > > Hi Justine,
> > > > >
> > > > > Indeed we are very much interested in the improvements that will
> come
> > > > from
> > > > > KIP-966!
> > > > >
> > > > > However I think there is still a gap regarding the failure
> detection
> > of
> > > > the
> > > > > leader. Please correct me if this is wrong but my understanding is
> > that
> > > > > with KIP-966 we'll stop advancing the HWM minISR isn't satisfied,
> and
> > > > will
> > > > > be able to fail-over leadership to any member of ELR following
> > complete
> > > > > failure of the leader. This gives good options for recovery
> > following a
> > > > > complete failure, however, if the leader remains degraded but
> doesn't
> > > > > completely fail then the partition will stay online but will still
> be
> > > > > unavailable for writes. (So probably a better subject for my email
> > > would
> > > > > have been "Single degraded brokers causing unavailable partitions",
> > > which
> > > > > is an issue which I think remains post KIP-966?)
> > > > >
> > > > > Best,
> > > > > Martin
> > > > >
> > > > >
> > > > > On Tue, Sep 10, 2024 at 5:14 PM Justine Olshan
> > > > > <jols...@confluent.io.invalid>
> > > > > wrote:
> > > > >
> > > > > > Hey Martin.
> > > > > >
> > > > > > I recommend you take a look at KIP-966. I think can help the use
> > case
> > > > you
> > > > > > are describing.
> > > > > > The KIP talks about failure scenarios, but I believe it will also
> > > help
> > > > > when
> > > > > > the leader has issues and kicks its followers out of ISR.
> > > > > > The goal is to better handle the "last replica standing" issue
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas
> > > > > >
> > > > > > Let me know if it helps,
> > > > > >
> > > > > > Justine
> > > > > >
> > > > > > On Tue, Sep 10, 2024 at 9:00 AM Martin Dickson
> > > > > > <martin.dick...@datadoghq.com.invalid> wrote:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > We have a recurring issue with single broker failures causing
> > > offline
> > > > > > > partitions. The issue is that when a leader is degraded,
> follower
> > > > > fetches
> > > > > > > can fail to happen in a timely manner, and all followers can
> fall
> > > out
> > > > > of
> > > > > > > sync. If that leader then later fails then the partition will
> go
> > > > > offline,
> > > > > > > but even if it remains only partially failed then applications
> > > might
> > > > > > still
> > > > > > > be impacted (for example, if the producer is using acks=all and
> > > > > > > min.insync.replicas=2). This can all happen because of a
> problem
> > > > solely
> > > > > > > with the leader, and hence a single broker failure can cause
> > > > > > > unavailability, even if RF=3 or higher.
> > > > > > >
> > > > > > > We’ve seen the issue with various kinds of failures, mostly
> > related
> > > > to
> > > > > > > failing disks, e.g. due to pressure on request handler threads
> > as a
> > > > > > result
> > > > > > > of produce requests waiting on a slow disk. But the easiest way
> > for
> > > > us
> > > > > to
> > > > > > > reproduce it is at the outgoing network level: Setting up a
> > cluster
> > > > > with
> > > > > > > moderate levels of ingoing throughput then injecting 50%
> outgoing
> > > > > packet
> > > > > > > drop on a single broker is enough to cause the partitions to
> > cause
> > > > > > follower
> > > > > > > requests to be slow and replication to lag, but not enough for
> > that
> > > > > > broker
> > > > > > > to lose its connection to ZK. This triggers the degraded broker
> > to
> > > > > become
> > > > > > > the only member of ISR.
> > > > > > >
> > > > > > > We have replica.lag.time.max.ms=10000 and
> > > > zookeeper.session.timeout.ms
> > > > > > > =6000
> > > > > > > (the pre-KIP-537 values, 1/3 of the current defaults, to
> control
> > > > > produce
> > > > > > > latency when a follower is failing). We are also able to
> > reproduce
> > > > the
> > > > > > > issue in the same way on a KRaft cluster with the KRaft
> defaults.
> > > > (Note
> > > > > > > that we are not very experienced with operating KRaft as we
> > aren’t
> > > > > > running
> > > > > > > it in production yet.)
> > > > > > >
> > > > > > > The last KIP I saw regarding this was KIP-501
> > > > > > > <
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-501+Avoid+out-of-sync+or+offline+partitions+when+follower+fetch+requests+are+not+processed+in+time
> > > > > > > >,
> > > > > > > which describes this exact problem. The proposed solution there
> > was
> > > > in
> > > > > > the
> > > > > > > first part to introduce a notion of pending requests, and
> second
> > > part
> > > > > to
> > > > > > > relinquish leadership if pending requests are taking too long.
> > The
> > > > > > > discussion
> > > > > > > thread <
> > > > > https://lists.apache.org/thread/1kbs68dq60p31frpfsr3x1vcqlzjf60x
> > > > > > >
> > > > > > > for that doesn’t come to a conclusion. However it is pointed
> out
> > > that
> > > > > not
> > > > > > > all failure modes would be solved by the pending requests
> > approach,
> > > > and
> > > > > > > that whilst relinquishing leadership seems ideal there are
> > concerns
> > > > > about
> > > > > > > this thrashing in certain failure modes.
> > > > > > >
> > > > > > > We are experimenting with a variation on KIP-501 where we add a
> > > > > heuristic
> > > > > > > for brokers failing this way: if the leader for a partition has
> > > > removed
> > > > > > > many followers from ISR in a short period of time (including
> the
> > > case
> > > > > > when
> > > > > > > it sends a single AlterPartition request removing all followers
> > > from
> > > > > ISR
> > > > > > > and thereby shrinking ISR only to itself), have the controller
> > > ignore
> > > > > > this
> > > > > > > request and instead choose one of the followers to become
> leader.
> > > To
> > > > > > avoid
> > > > > > > thrashing, rate-limit how often the controller can do this per
> > > > > > > topic-partition. We have tested that this fixes our repro, but
> > have
> > > > not
> > > > > > > productionised it (see rough draft PR
> > > > > > > <https://github.com/DataDog/kafka/pull/15/files>). We have
> only
> > > > > > > implemented
> > > > > > > ZK-mode so far. We implemented this on the controller side out
> of
> > > > > > > convenience (no API changes), but potentially the demotion
> > decision
> > > > > > should
> > > > > > > be taken at the broker level instead, which should also be
> > > possible.
> > > > > > >
> > > > > > > Whilst the code change is small, the proposed solution we’re
> > > > > > investigating
> > > > > > > isn’t very clean and we’re not totally satisfied with it. We
> > wanted
> > > > to
> > > > > > get
> > > > > > > some ideas from the community on:
> > > > > > > 1. How are other folks handling this class of issues?
> > > > > > > 2. Is there any interest in adding more comprehensive failing
> > > > > > > broker detection to Kafka (particularly how this could look in
> > > > KRaft)?
> > > > > > > 3. Is there any interest in having a heuristic failure
> detection
> > > like
> > > > > the
> > > > > > > one described above?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Martin
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > ========================
> > Okada Haruki
> > ocadar...@gmail.com
> > ========================
> >
>


-- 
========================
Okada Haruki
ocadar...@gmail.com
========================

Re: [DISCUSS] Single broker failures causing offline partitions

Reply via email to