Hi Justine, Indeed we are very much interested in the improvements that will come from KIP-966!
However I think there is still a gap regarding the failure detection of the leader. Please correct me if this is wrong but my understanding is that with KIP-966 we'll stop advancing the HWM minISR isn't satisfied, and will be able to fail-over leadership to any member of ELR following complete failure of the leader. This gives good options for recovery following a complete failure, however, if the leader remains degraded but doesn't completely fail then the partition will stay online but will still be unavailable for writes. (So probably a better subject for my email would have been "Single degraded brokers causing unavailable partitions", which is an issue which I think remains post KIP-966?) Best, Martin On Tue, Sep 10, 2024 at 5:14 PM Justine Olshan <jols...@confluent.io.invalid> wrote: > Hey Martin. > > I recommend you take a look at KIP-966. I think can help the use case you > are describing. > The KIP talks about failure scenarios, but I believe it will also help when > the leader has issues and kicks its followers out of ISR. > The goal is to better handle the "last replica standing" issue > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-966%3A+Eligible+Leader+Replicas > > Let me know if it helps, > > Justine > > On Tue, Sep 10, 2024 at 9:00 AM Martin Dickson > <martin.dick...@datadoghq.com.invalid> wrote: > > > Hi all, > > > > We have a recurring issue with single broker failures causing offline > > partitions. The issue is that when a leader is degraded, follower fetches > > can fail to happen in a timely manner, and all followers can fall out of > > sync. If that leader then later fails then the partition will go offline, > > but even if it remains only partially failed then applications might > still > > be impacted (for example, if the producer is using acks=all and > > min.insync.replicas=2). This can all happen because of a problem solely > > with the leader, and hence a single broker failure can cause > > unavailability, even if RF=3 or higher. > > > > We’ve seen the issue with various kinds of failures, mostly related to > > failing disks, e.g. due to pressure on request handler threads as a > result > > of produce requests waiting on a slow disk. But the easiest way for us to > > reproduce it is at the outgoing network level: Setting up a cluster with > > moderate levels of ingoing throughput then injecting 50% outgoing packet > > drop on a single broker is enough to cause the partitions to cause > follower > > requests to be slow and replication to lag, but not enough for that > broker > > to lose its connection to ZK. This triggers the degraded broker to become > > the only member of ISR. > > > > We have replica.lag.time.max.ms=10000 and zookeeper.session.timeout.ms > > =6000 > > (the pre-KIP-537 values, 1/3 of the current defaults, to control produce > > latency when a follower is failing). We are also able to reproduce the > > issue in the same way on a KRaft cluster with the KRaft defaults. (Note > > that we are not very experienced with operating KRaft as we aren’t > running > > it in production yet.) > > > > The last KIP I saw regarding this was KIP-501 > > < > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-501+Avoid+out-of-sync+or+offline+partitions+when+follower+fetch+requests+are+not+processed+in+time > > >, > > which describes this exact problem. The proposed solution there was in > the > > first part to introduce a notion of pending requests, and second part to > > relinquish leadership if pending requests are taking too long. The > > discussion > > thread <https://lists.apache.org/thread/1kbs68dq60p31frpfsr3x1vcqlzjf60x > > > > for that doesn’t come to a conclusion. However it is pointed out that not > > all failure modes would be solved by the pending requests approach, and > > that whilst relinquishing leadership seems ideal there are concerns about > > this thrashing in certain failure modes. > > > > We are experimenting with a variation on KIP-501 where we add a heuristic > > for brokers failing this way: if the leader for a partition has removed > > many followers from ISR in a short period of time (including the case > when > > it sends a single AlterPartition request removing all followers from ISR > > and thereby shrinking ISR only to itself), have the controller ignore > this > > request and instead choose one of the followers to become leader. To > avoid > > thrashing, rate-limit how often the controller can do this per > > topic-partition. We have tested that this fixes our repro, but have not > > productionised it (see rough draft PR > > <https://github.com/DataDog/kafka/pull/15/files>). We have only > > implemented > > ZK-mode so far. We implemented this on the controller side out of > > convenience (no API changes), but potentially the demotion decision > should > > be taken at the broker level instead, which should also be possible. > > > > Whilst the code change is small, the proposed solution we’re > investigating > > isn’t very clean and we’re not totally satisfied with it. We wanted to > get > > some ideas from the community on: > > 1. How are other folks handling this class of issues? > > 2. Is there any interest in adding more comprehensive failing > > broker detection to Kafka (particularly how this could look in KRaft)? > > 3. Is there any interest in having a heuristic failure detection like the > > one described above? > > > > Thanks, > > Martin > > >