HI Juha,

That's for the discussion.

On Thu, Apr 3, 2025 at 4:08 AM Juha Mynttinen
<juha.myntti...@aiven.io.invalid> wrote:
>Consider the following Kafka controller setup. There are three controllers
> c1, c2 and c3, each on its own hardware. All controllers are voters and
> let’s assume c1 is the leader. Assume new servers can be added as needed to
> replace broken one, but broken/lost servers cannot be brought back. If a
> new server is needed, it’d be c4 and then the one after that would be c5.

Why did you stop at 2 lost disks? What if the user lost all of the
controllers? Why do you assume that c1 has the committed data? What
happens to the cluster if c1 doesn't have the committed data?

If a user wants to tolerate 2 disk failures and stay consistent they
need 5 controllers.

> It can be seen as a limitation in the kraft tooling that the controller
> cluster cannot be salvaged in this situation. If we compare with the
> ZooKeeper based Kafka, it's possible to recover from a situation where a
> majority of the ZK participants are lost. The ZooKeeper cluster can be
> fixed in a situation like that. Similarly, in Kafka brokers, if there’s a
> topic with replication factor three and the non-leaders are lost, there’s
> no data loss and it’s possible to recover without data loss.

It is not true that there won't be data loss. For ZK, this is easily
reproducible. I sent an email to the ZK mailing reproducing this loss
of data: https://lists.apache.org/thread/nrmbjj1wm82pnr1jr8lkk294nodh9611

For Kafka ISR partitions we design KIP-966 to address some of these issues.

> In order to do damage control in a disastrous situation, and to get a
> cluster online again, It’d be important to have some mechanism to recover
> from this situation. Such a tool would obviously come with a risk. The tool
> would essentially need to bypass the mechanism of requiring the majority of
> voters, somehow. However, since the controller cluster is already broken at
> this point, it can be argued that a risky tool in this situation is worth
> it. If the tool fixes the controller cluster, it achieves its purpose. If
> it breaks the cluster more, the situation doesn’t get any worse.

Your solutions assume that c1 has the committed data. This is not
true. The majority of the voters are needed to determine the committed
data. Any data loss in the cluster metadata partition can result in a
catastrophic data loss and unavailability of the entire Kafka cluster.
The cluster metadata partition is the most important partition in
their Kafka cluster. If the user wants to tolerate 2 disk failures
they should provision 5 controllers. We should not be encouraging
users to under provision the controller cluster and hope that their
cluster will remain available and consistent.

Thanks,
-- 
-José

Reply via email to