Consider the following Kafka controller setup. There are three controllers c1, c2 and c3, each on its own hardware. All controllers are voters and let’s assume c1 is the leader. Assume new servers can be added as needed to replace broken one, but broken/lost servers cannot be brought back. If a new server is needed, it’d be c4 and then the one after that would be c5.
Now, let’s assume the controllers c2 and c3 are irreversibly lost for some reason, e.g. because of an unrecoverable hardware failure. In order to repair the controller cluster, i.e. to make it again have enough voters (at least two), we’d like to remove the voters c2 and c3 (make them observers) and add controllers c4 and c5 to the cluster as voters. In order to do this, we create those servers and they join the controller cluster as observers. However, adding c4 or c5 as a voter is currently not possible in this situation, nor is making c2 and c3 observers. The controller cluster is not able to handle AddRaftVoterRequest to add c4 or c5 because the majority of the voters is needed and only one out of three is available. Similarly, RemoveRaftVoterRequest cannot be handled. It can be seen as a limitation in the kraft tooling that the controller cluster cannot be salvaged in this situation. If we compare with the ZooKeeper based Kafka, it's possible to recover from a situation where a majority of the ZK participants are lost. The ZooKeeper cluster can be fixed in a situation like that. Similarly, in Kafka brokers, if there’s a topic with replication factor three and the non-leaders are lost, there’s no data loss and it’s possible to recover without data loss. In order to do damage control in a disastrous situation, and to get a cluster online again, It’d be important to have some mechanism to recover from this situation. Such a tool would obviously come with a risk. The tool would essentially need to bypass the mechanism of requiring the majority of voters, somehow. However, since the controller cluster is already broken at this point, it can be argued that a risky tool in this situation is worth it. If the tool fixes the controller cluster, it achieves its purpose. If it breaks the cluster more, the situation doesn’t get any worse. There exists a few approaches: 1) A tool that is executed manually to fix the situation. It would work so that the first remaining controller c1 is stopped (all controllers need to be stopped) and then the tool could be used to forcefully append a remove voter record in the metadata log to remove c2 and c3. Then, c1 is started. Since it’s the only voter, the controller cluster is operational and c4 and c5 can be added as voters. 2) Make the leader somehow accept a remove or add voter even if there is not a majority of the voters. This could be done either by adding a new boolean field “force” to RemoveRaftVoterRequest (remove message) and somehow the controller would bypass the highwatermark check, OR there could be a configuration parameter in the controller config that (if enabled) would make the controller bypass the HWM check. 3) Fake the lost broker. Basically, a tool could create the necessary config for a “fake” c2, format it and attach it to the controller cluster. Then the cluster would have enough voters. Then the tool would remove the voter c3 and c2. Then the fake controller would be stopped. As a result only c1 would be a voter. c4 and c5 could then be added as voters. The approach 2) sounds most involved, since it’d mean changes in Kafka itself. The approach 3 feels a little hacky, but requires no changes in Kafka. The approach 1) could be the most straightforward one. It’s similar to the mechanism of formatting a controller with initial controllers, which similarlly appends to the metadata log. Note even if a three node controller setup is discussed, the problem is the same with any number of voters. There’s always a limit. If more voters than that limit are lost, the cluster goes into the state described above. -- Regards, Juha