Hi everyone, Earlier today Colin, Ron, Proven and I had a chat about this work. We discussed several aspects which I’d like to share here.
## A new reserved UUID We'll reserve a third UUID to indicate an unspecified dir, but one that is known to be selected. As opposed to the default UNKNOWN_DIR (ZERO_UUID) which is used for new replicas, which may or may not have been placed in a some directory, this new UUID can disambiguate transition scenarios where we previously did not have any directory assignment information — e.g. single-logdir KRaft into JBOD KRaft, or ZK JBOD to KRaft JBOD mode. Unless anyone has a better suggestion for naming or any objection, I'll update the KIP to designate the following reserved UUIDs: * UNKNOWN_DIR - new Uuid(0L, 0L) * OFFLINE_DIR - new Uuid(0L, 1L) * SELECTED_DIR - new Uuid(0L, 2L) <-- new When transitioning to the directory assignment feature, without any previous directory assignment state, the controller can assign all existing partitions in the broker to SELECTED_DIR, to distinguish them from new partitions. When a log directory is offline, it is important that the broker does not replace the offline partitions in the remaining online directories. So, if some directory is offline, and some partition is missing the directory assignment, then it is important to distinguish new partitions from old ones. Old partitions may already exist in the offline dirs, but new partitions can safely be placed in the available (online) dirs. In ZK mode, the `isNew` flag in the LeaderAndIsr request serves this purpose. And for KRaft this KIP proposes keeping the broker in fenced mode until all initial assignments are known. But this additional UUID serves as an additional signal and covers a gap in the ZK->KRaft migration stage, where ZK brokers do not support fencing state. SELECTED_DIR is always a temporary transition state, which is due to be resolved by the broker. When catching up with metadata, for any partitions associated with SELECTED_DIR: * If the partition is found in some directory, AssignReplicasToDirs is used to correct the assignment to the actual dir. * If the partition is not found, and no directory is offline, a directory is selected, and AssignReplicasToDirs is used to correct the assignment to the chosen directory. * If the partition is not found and some directory is offline, the broker assumes that the partition must be in one of the offline dirs and AssignReplicasToDirs is used to converge the state to OFFLINE_DIR. This contrasts with UNKNOWN_DIR, for which brokers always select a directory, regardless of the online/offline state of any log dirs. ## Reserving a pool of non-designated UUIDs for future use It’s probably a good idea to reserve a bunch of UUIDs for future use in directory assignments. The decision is essentially costless right now, and it may prove to be useful in the future. The first 100 UUIDs (including the 3 already designated above) will be reserved for future use. ## Dir failures during ZK->KRaft migration The KRaft controller ZK compatibility controller functionality does not currently implement dir failure handling. So we need to select a strategy to deal with dir failures during the migration. We discussed different options: a) If a migrating ZK mode broker encounters a directory failure, it will shutdown. While this degrades failure handling during, the temporary migration window, it is a useful simplification. This is an attractive option, and it isn't ruled out, but it is also not clear that it is necessary at this point. b) Extending the ZK Controller compatibility functionality in KRaft controllers to watch the /log_dir_event_notification znode, and rely on LeaderAndIsr requests for dir failure handling, same as ZK Controllers do. As there is a desire to limit the scope of the compatibility functionality, this option looks less attractive. c) Extend the ZK mode broker functionality during the migration to both send AssignReplicasToDirs populating the assignment state earlier, and propagate dir failures in the heartbeat in the same way the KIP proposes regular KRaft brokers do as well. There are several phases in the ZK->KRaft migration, and part of the process requires ZK mode brokers to send BrokerRegistration and BrokerHeartbeat requests already, so this doesn't look like a big change, and seems to be the preferred option. If you have any thoughts or questions on any of these matters, please let me know. Best, -- Igor