Re: KRaft controller disaster recovery

2025-04-10 Thread Anton Agestam
Hi all, Thanks Juha for bringing this discussion here. To everyone else, I am Juha's colleague at Aiven and am currently working on introducing the kind of tooling discussed in this thread, to be used in worst case scenarios. I have a proof of concept working. The various input and concerns raised

Re: KRaft controller disaster recovery

2025-04-08 Thread José Armando García Sancio
Hi Luke and Colin, On Mon, Apr 7, 2025 at 10:29 PM Luke Chen wrote: > That's why we were discussing if there's any way to "force" recover the > scenario, even if it's possible to have data loss. Yes. There is a way. They need to configure a controller cluster that matches the voter set in the cl

Re: KRaft controller disaster recovery

2025-04-08 Thread Colin McCabe
On Mon, Apr 7, 2025, at 19:29, Luke Chen wrote: > Hi Jose and Colin, > > Thanks for your explanation! > > Yes, we all agree that 3 node quorum can only tolerate 1 node down. > We just want to discuss, "what if" 2 out of 3 nodes are down at the same > time, what can we do? > Currently, the result is

Re: KRaft controller disaster recovery

2025-04-07 Thread Luke Chen
Hi Jose and Colin, Thanks for your explanation! Yes, we all agree that 3 node quorum can only tolerate 1 node down. We just want to discuss, "what if" 2 out of 3 nodes are down at the same time, what can we do? Currently, the result is that the quorum will never form and all the kafka cluster is

Re: KRaft controller disaster recovery

2025-04-07 Thread Colin McCabe
Hi José, I think you make a valid point that our guarantees here are not actually different from zookeeper. In both systems, if you lose quorum, you will probably lose some data. Of course, how much data you lose depends on luck. If the last node standing was the active controller / zookeeper,

Re: KRaft controller disaster recovery

2025-04-07 Thread José Armando García Sancio
Thanks Luke. On Thu, Apr 3, 2025 at 7:14 AM Luke Chen wrote: > In addition to the approaches you provided, maybe we can have a way to > "force" KRaft to honor "controller.quorum.voters" config, instead of > "controller.quorum.bootstrap.servers", even it's in kraft.version 1. Small clarification.

Re: KRaft controller disaster recovery

2025-04-07 Thread José Armando García Sancio
HI Juha, That's for the discussion. On Thu, Apr 3, 2025 at 4:08 AM Juha Mynttinen wrote: >Consider the following Kafka controller setup. There are three controllers > c1, c2 and c3, each on its own hardware. All controllers are voters and > let’s assume c1 is the leader. Assume new servers can b

KRaft controller disaster recovery

2025-04-05 Thread Juha Mynttinen
Consider the following Kafka controller setup. There are three controllers c1, c2 and c3, each on its own hardware. All controllers are voters and let’s assume c1 is the leader. Assume new servers can be added as needed to replace broken one, but broken/lost servers cannot be brought back. If a new

Re: KRaft controller disaster recovery

2025-04-05 Thread Luke Chen
Hi Juha, Thanks for bringing this. I agree having a way to recover from this "majority of controller nodes down" issue is valuable, even though this is rare. In addition to the approaches you provided, maybe we can have a way to "force" KRaft to honor "controller.quorum.voters" config, instead of