Hi Juha, Thanks for bringing this. I agree having a way to recover from this "majority of controller nodes down" issue is valuable, even though this is rare. In addition to the approaches you provided, maybe we can have a way to "force" KRaft to honor "controller.quorum.voters" config, instead of "controller.quorum.bootstrap.servers", even it's in kraft.version 1. This way, we can update the configs and roll the alive controller nodes (c1) + start the new controller nodes (c4, c5) to recover it.
Thanks. Luke On Thu, Apr 3, 2025 at 4:31 PM Juha Mynttinen <juha.myntti...@aiven.io.invalid> wrote: > Consider the following Kafka controller setup. There are three controllers > c1, c2 and c3, each on its own hardware. All controllers are voters and > let’s assume c1 is the leader. Assume new servers can be added as needed to > replace broken one, but broken/lost servers cannot be brought back. If a > new server is needed, it’d be c4 and then the one after that would be c5. > > Now, let’s assume the controllers c2 and c3 are irreversibly lost for some > reason, e.g. because of an unrecoverable hardware failure. In order to > repair the controller cluster, i.e. to make it again have enough voters (at > least two), we’d like to remove the voters c2 and c3 (make them observers) > and add controllers c4 and c5 to the cluster as voters. In order to do > this, we create those servers and they join the controller cluster as > observers. However, adding c4 or c5 as a voter is currently not possible in > this situation, nor is making c2 and c3 observers. The controller cluster > is not able to handle AddRaftVoterRequest to add c4 or c5 because the > majority of the voters is needed and only one out of three is available. > Similarly, RemoveRaftVoterRequest cannot be handled. > > It can be seen as a limitation in the kraft tooling that the controller > cluster cannot be salvaged in this situation. If we compare with the > ZooKeeper based Kafka, it's possible to recover from a situation where a > majority of the ZK participants are lost. The ZooKeeper cluster can be > fixed in a situation like that. Similarly, in Kafka brokers, if there’s a > topic with replication factor three and the non-leaders are lost, there’s > no data loss and it’s possible to recover without data loss. > > In order to do damage control in a disastrous situation, and to get a > cluster online again, It’d be important to have some mechanism to recover > from this situation. Such a tool would obviously come with a risk. The tool > would essentially need to bypass the mechanism of requiring the majority of > voters, somehow. However, since the controller cluster is already broken at > this point, it can be argued that a risky tool in this situation is worth > it. If the tool fixes the controller cluster, it achieves its purpose. If > it breaks the cluster more, the situation doesn’t get any worse. > > There exists a few approaches: > 1) A tool that is executed manually to fix the situation. It would work so > that the first remaining controller c1 is stopped (all controllers need to > be stopped) and then the tool could be used to forcefully append a remove > voter record in the metadata log to remove c2 and c3. Then, c1 is started. > Since it’s the only voter, the controller cluster is operational and c4 and > c5 can be added as voters. > 2) Make the leader somehow accept a remove or add voter even if there is > not a majority of the voters. This could be done either by adding a new > boolean field “force” to RemoveRaftVoterRequest (remove message) and > somehow the controller would bypass the highwatermark check, OR there could > be a configuration parameter in the controller config that (if enabled) > would make the controller bypass the HWM check. > 3) Fake the lost broker. Basically, a tool could create the necessary > config for a “fake” c2, format it and attach it to the controller cluster. > Then the cluster would have enough voters. Then the tool would remove the > voter c3 and c2. Then the fake controller would be stopped. As a result > only c1 would be a voter. c4 and c5 could then be added as voters. > > The approach 2) sounds most involved, since it’d mean changes in Kafka > itself. The approach 3 feels a little hacky, but requires no changes in > Kafka. The approach 1) could be the most straightforward one. It’s similar > to the mechanism of formatting a controller with initial controllers, which > similarly appends to the metadata log. > > Note even if a three node controller setup is discussed, the problem is the > same with any number of voters. There’s always a limit. If more voters than > that limit are lost, the cluster goes into the state described above. > > -- > Regards, > Juha >