Hi Juha,

Thanks for bringing this.
I agree having a way to recover from this "majority of controller nodes
down" issue is valuable, even though this is rare.
In addition to the approaches you provided, maybe we can have a way to
"force" KRaft to honor "controller.quorum.voters" config, instead of
"controller.quorum.bootstrap.servers", even it's in kraft.version 1. This
way, we can update the configs and roll the alive controller nodes (c1) +
start the new controller nodes (c4, c5) to recover it.

Thanks.
Luke

On Thu, Apr 3, 2025 at 4:31 PM Juha Mynttinen
<juha.myntti...@aiven.io.invalid> wrote:

> Consider the following Kafka controller setup. There are three controllers
> c1, c2 and c3, each on its own hardware. All controllers are voters and
> let’s assume c1 is the leader. Assume new servers can be added as needed to
> replace broken one, but broken/lost servers cannot be brought back. If a
> new server is needed, it’d be c4 and then the one after that would be c5.
>
> Now, let’s assume the controllers c2 and c3 are irreversibly lost for some
> reason, e.g. because of an unrecoverable hardware failure. In order to
> repair the controller cluster, i.e. to make it again have enough voters (at
> least two), we’d like to remove the voters c2 and c3 (make them observers)
> and add controllers c4 and c5 to the cluster as voters. In order to do
> this, we create those servers and they join the controller cluster as
> observers. However, adding c4 or c5 as a voter is currently not possible in
> this situation, nor is making c2 and c3 observers. The controller cluster
> is not able to handle AddRaftVoterRequest to add c4 or c5 because the
> majority of the voters is needed and only one out of three is available.
> Similarly, RemoveRaftVoterRequest cannot be handled.
>
> It can be seen as a limitation in the kraft tooling that the controller
> cluster cannot be salvaged in this situation. If we compare with the
> ZooKeeper based Kafka, it's possible to recover from a situation where a
> majority of the ZK participants are lost. The ZooKeeper cluster can be
> fixed in a situation like that. Similarly, in Kafka brokers, if there’s a
> topic with replication factor three and the non-leaders are lost, there’s
> no data loss and it’s possible to recover without data loss.
>
> In order to do damage control in a disastrous situation, and to get a
> cluster online again, It’d be important to have some mechanism to recover
> from this situation. Such a tool would obviously come with a risk. The tool
> would essentially need to bypass the mechanism of requiring the majority of
> voters, somehow. However, since the controller cluster is already broken at
> this point, it can be argued that a risky tool in this situation is worth
> it. If the tool fixes the controller cluster, it achieves its purpose. If
> it breaks the cluster more, the situation doesn’t get any worse.
>
> There exists a few approaches:
> 1) A tool that is executed manually to fix the situation. It would work so
> that the first remaining controller c1 is stopped (all controllers need to
> be stopped) and then the tool could be used to forcefully append a remove
> voter record in the metadata log to remove c2 and c3. Then, c1 is started.
> Since it’s the only voter, the controller cluster is operational and c4 and
> c5 can be added as voters.
> 2) Make the leader somehow accept a remove or add voter even if there is
> not a majority of the voters. This could be done either by adding a new
> boolean field “force” to RemoveRaftVoterRequest (remove message) and
> somehow the controller would bypass the highwatermark check, OR there could
> be a configuration parameter in the controller config that (if enabled)
> would make the controller bypass the HWM check.
> 3) Fake the lost broker. Basically, a tool could create the necessary
> config for a “fake” c2, format it and attach it to the controller cluster.
> Then the cluster would have enough voters. Then the tool would remove the
> voter c3 and c2. Then the fake controller would be stopped. As a result
> only c1 would be a voter. c4 and c5 could then be added as voters.
>
> The approach 2) sounds most involved, since it’d mean changes in Kafka
> itself. The approach 3 feels a little hacky, but requires no changes in
> Kafka. The approach 1) could be the most straightforward one. It’s similar
> to the mechanism of formatting a controller with initial controllers, which
> similarly appends to the metadata log.
>
> Note even if a three node controller setup is discussed, the problem is the
> same with any number of voters. There’s always a limit. If more voters than
> that limit are lost, the cluster goes into the state described above.
>
> --
> Regards,
> Juha
>

Reply via email to