Consider the following Kafka controller setup. There are three controllers
c1, c2 and c3, each on its own hardware. All controllers are voters and
let’s assume c1 is the leader. Assume new servers can be added as needed to
replace broken one, but broken/lost servers cannot be brought back. If a
new server is needed, it’d be c4 and then the one after that would be c5.

Now, let’s assume the controllers c2 and c3 are irreversibly lost for some
reason, e.g. because of an unrecoverable hardware failure. In order to
repair the controller cluster, i.e. to make it again have enough voters (at
least two), we’d like to remove the voters c2 and c3 (make them observers)
and add controllers c4 and c5 to the cluster as voters. In order to do
this, we create those servers and they join the controller cluster as
observers. However, adding c4 or c5 as a voter is currently not possible in
this situation, nor is making c2 and c3 observers. The controller cluster
is not able to handle AddRaftVoterRequest to add c4 or c5 because the
majority of the voters is needed and only one out of three is available.
Similarly, RemoveRaftVoterRequest cannot be handled.

It can be seen as a limitation in the kraft tooling that the controller
cluster cannot be salvaged in this situation. If we compare with the
ZooKeeper based Kafka, it's possible to recover from a situation where a
majority of the ZK participants are lost. The ZooKeeper cluster can be
fixed in a situation like that. Similarly, in Kafka brokers, if there’s a
topic with replication factor three and the non-leaders are lost, there’s
no data loss and it’s possible to recover without data loss.

In order to do damage control in a disastrous situation, and to get a
cluster online again, It’d be important to have some mechanism to recover
from this situation. Such a tool would obviously come with a risk. The tool
would essentially need to bypass the mechanism of requiring the majority of
voters, somehow. However, since the controller cluster is already broken at
this point, it can be argued that a risky tool in this situation is worth
it. If the tool fixes the controller cluster, it achieves its purpose. If
it breaks the cluster more, the situation doesn’t get any worse.

There exists a few approaches:
1) A tool that is executed manually to fix the situation. It would work so
that the first remaining controller c1 is stopped (all controllers need to
be stopped) and then the tool could be used to forcefully append a remove
voter record in the metadata log to remove c2 and c3. Then, c1 is started.
Since it’s the only voter, the controller cluster is operational and c4 and
c5 can be added as voters.
2) Make the leader somehow accept a remove or add voter even if there is
not a majority of the voters. This could be done either by adding a new
boolean field “force” to RemoveRaftVoterRequest (remove message) and
somehow the controller would bypass the highwatermark check, OR there could
be a configuration parameter in the controller config that (if enabled)
would make the controller bypass the HWM check.
3) Fake the lost broker. Basically, a tool could create the necessary
config for a “fake” c2, format it and attach it to the controller cluster.
Then the cluster would have enough voters. Then the tool would remove the
voter c3 and c2. Then the fake controller would be stopped. As a result
only c1 would be a voter. c4 and c5 could then be added as voters.

The approach 2) sounds most involved, since it’d mean changes in Kafka
itself. The approach 3 feels a little hacky, but requires no changes in
Kafka. The approach 1) could be the most straightforward one. It’s similar
to the mechanism of formatting a controller with initial controllers, which
similarlly appends to the metadata log.

Note even if a three node controller setup is discussed, the problem is the
same with any number of voters. There’s always a limit. If more voters than
that limit are lost, the cluster goes into the state described above.

--
Regards,
Juha

Reply via email to