Hi José,

Thanks for the revisions. I'm really excited to see this going forward for 
Kafka 3.8.

One important piece of feedback that a lot of people have given me is that they 
really want auto-formatting in KRaft mode. In other words, they want to start 
up a process and just have it do the right thing, without having to run a 
special command like "kafka-storage.sh format" to set up the storage 
directories.

One reason why users want auto-formatting is that ZK mode had it. Of course, ZK 
mode's auto-formatting is not safe. It can lead to data loss since it breaks 
the replication invariant that brokers never lose data after ACKing it. But 
hardly any users are aware of this. All they know is that they want things to 
work like they did in ZK mode.

Another reason why users want auto-formatting is that it makes it easier to 
integrate Kafka into systems like Kubernetes, Ansible, Puppet, and so forth. 
These systems generally let the administrator set a "desired state." They then 
take a look at the "actual state" and manipulate it until it matches the 
desired state.

These process management systems tend to be oriented around spinning up a new 
process or dropping in a new config file. They don't like to make RPCs or 
invoke command-line tools. Of course it's POSSIBLE to make them do this, but it 
feels awkward, and it's extra work. Even worse, it's work that the integrators 
tend to get wrong. Most of them don't understand why naively re-formatting a 
controller storage directory every time it looks empty is a bad idea.

In other words, If we don't implement auto-formatting in Kafka, the integrators 
will re-invent it outside Kafka. And they'll almost certainly do it incorrectly 
in a way that may cause metadata loss. So I really do think we should get this 
right in Kafka itself.

If we can run through a few scenarios here:

1. restarting a controller with an empty storage directory

The controller can contact the quorum to get the cluster ID and current MV. If 
the MV doesn't support quorum reconfiguration, it can bail out. Otherwise, it 
can re-format itself with a random directory ID. It can then remove (ID, 
OLD_DIR_ID) from the quorum, and add (ID, NEW_DIR_ID) to the quorum.

I think this can all be done automatically without user intervention. If the 
remove / add steps fail (because the quorum is down, for example), then of 
course we can just log an exception and bail out.

2. restarting a broker with an empty storage directory

The broker can contact the quorum to get the cluster ID and current MV. If the 
MV doesn't support directory IDs, we can bail out. Otherwise, it can reformat 
itself with a random directory ID and start up. Its old replicas will be 
correctly treated as gone due to the JBOD logic.

3. restarting a controller with a damaged metadata directory

I think we can just bail out if the storage directory doesn't look right. Empty 
is OK. Damaged is not.

4. Bringing up a totally new cluster

I think we need at least one controller node to be formatted, so that we can 
decide what metadata version to use. Perhaps we should even require a quorum of 
controller nodes to be explicitly formatted (aka, in practice, people just 
format them all).

5. Removing a controller

I think in this case, we can have an explicit command. This is similar to the 
broker case, where we have the "kafka-cluster.sh unregister" command.

best,
Colin


On Mon, Jan 8, 2024, at 10:13, José Armando García Sancio wrote:
> Hi all,
>
> KIP-853: KRaft Controller Membership Changes is ready for another
> round of discussion.
>
> There was a previous discussion thread at
> https://lists.apache.org/thread/zb5l1fsqw9vj25zkmtnrk6xm7q3dkm1v
>
> I have changed the KIP quite a bit since that discussion. The core
> idea is still the same. I changed some of the details to be consistent
> with some of the protocol changes to Kafka since the original KIP. I
> also added a section that better describes the feature's UX.
>
> KIP: https://cwiki.apache.org/confluence/x/nyH1D
>
> Thanks. Your feedback is greatly appreciated!
> -- 
> -José

Reply via email to