Hi Jun, Thank you for your comments and questions.
30. Thank you for pointing this out. The isNew flag is not available in KRaft mode. The broker can consider the metadata records: If, and only if, the logdir assigned is Uuid.ZERO then the replica can be considered new. Being able to determine if a replica "isNew" is important to prevent the remaining logdirs from filling up logdirs when some of them become offline by re-creating replicas that already exist in the offline logdirs. So the broker will refuse to create logs that are not new if there are any offline logdirs. If a logdir is removed from configuration, the controller will detect this change upon broker registration and reset all partitions assigned to the removed logdirs to Uuid.ZERO. In this case, it is OK for the broker to assume that the partitions are new because they do not exist in any _configured_ online or offline logdir, and the intended behavior is to re-create them in one of the online logdirs anyway. I have updated the KIP to make it clear broker decisions are based on the metadata, and not on this flag. 31. I don't think I understand the question. Why do we need to assign the same UUID? A logdir may be replaced with a disk by replacing its configured path with the new disk mount path under the `log.dirs` property. While the broker was offline, the operator might have copied the contents of the old logdir to the disk, or not. If contents were copied over, then so was the logdir's meta.properties, along with the UUID, in which case no change is necessary. The broker will load all configured logdir paths, all existing meta.properties, and verify that the full set of UUIDs is still congruent across all meta.properties files. Neither broker or controller will know that something has changed, and neither of them needs to. All partition assignments are still correct. The mapping of UUID to logdir is determined by the meta.propeties under that same logdir. If the contents were not copied then this is assumed to be a new and empty logdir. It should get a different UUID. When the broker loads all meta.properties it will verify that one is missing for the new disk and create it, generating a new UUID. It will also update the full set of UUIDs listed in any other meta.properties files. On the broker registration request the controller will notice a new UUID being registered, but also notice a UUID missing. Any topic partitions assinged to the now missing logdir UUID will be updated to relate to UUID.Zero, so that the broker can place them in the most suitable logdir - which is likely to be the new and empty one. 32. You are correct, the HeartBeat request should convey the failure and the broker shouldn't need to send a AssignReplicasToDirs request. The bit preceding that quote is important: "If the partition is assigned to an online log directory" In this case the broker finds that the metadata indicates that a non-new replica is assigned to an online logdir in the metadata but this replica cannot actually be found in any online logdir. So we want to tell the controller that the metadata is wrong, and that the replica is actually offline. This is a defensive design option. In a scenario where for some reason the broker can see that the metadata is incorrect about the logdir assignment of replica that existed in the failed logdir, it is better to correct and recover than to allow the problem to persist. Ignoring the error could mean that the partition stays offline. If the controller is only told about the UUID of the failure logdir, it won't be able to determine that a leadership and ISR update is required for any replica with an incorrect logdir assignment. An alternative – when facing this unlikely failure scenario – would be for the broker to error and exit, which would be more disruptive. 33. Correct. I should've made that clear. Updated. 34. No. It shouldn't be a large request, and it should only happen rarely. This relates to point 32. When a logdir fails, that failure is communicated to the controller by indicating the logdir UUID in the heartbeat request. The controller can determine that _the partitions assigned to that logdir UUID_ are now offline. But, if there are any partitions that were in that logdir and do not have that same logdir UUID assigned to it in the cluster metadata then the broker needs to signal that these are also offline, as the controller will not be able to determine that without the assignment. We expect each broker to proactively instruct the controller to keep the metadata correct about the logdir assignment for each replica, so situations where the metadata is wrong should be rare, and when they happen only a small number of replicas should be affected. Hence this should be both a small and rare request. 35. Hmm, I could not find the string "AlterReplicaDirRequest" in the source: https://github.com/apache/kafka/search?q=AlterReplicaDirRequest I'm referring to this API key: clients/src/main/java/org/apache/kafka/common/protocol/ApiKeys.java#L78 36. The risk is that if the broker is unfenced while the controller still has an incorrect view of the logdir assignment it may assign leadership to the broker for some partition which is incorrectly assigned in metadata. If that happens, when a logdir fails, the heartbeat request indicating the failed logdir UUID will not cause the controller to take action and reassign leadership, and we may end up with an unavailable partition. The controller will assume that partition leadership is being performed correctly and will not take any action, as long as it thinks the broker is alive, and that the partition is assigned to an online logdir. It could be interesting to find a more general solution to this issue, as that would eliminate a wider range of failures in Kafka. But I don't currently have any suggestions there. The requests sent while still fenced aim to correct the logdir assignment for all of the partitions in the broker. One of the reasons that the assignment may be incorrect is that an operator might have relocated some partitions to a different logdir while the broker was offline. This is a currently supported feature - albeit probably not widely known. Why is it important that there should be no other requests while the broker is still fenced? 37. I had originally proposed that if there is a single logdir configured the controller could assume that all the existing replicas are assigned to the only logdir indicated in broker registration request, provided there isn't a previous registration that indicates any logdir UUIDs. This would avoid the broker sending AssignReplicasToDirs to populate the the initial assignments. If the broker is registering with a single logdir, but the previous broker registration indicates some logdir UUID then the controller cannot make this simplification, as the logdir could be a new one, or a previous second logdir might have been removed from configuration and the current assignment is unclear. We could maybe say that whenever there is a single logdir, the broker will not bother about the assignment in general. The downside of this is that there might be more work to do later (more partition assignments to correct) when a second logdir is configured. I think may be more disruptive. It is preferable to spread out the effort to maintain a correct assignment in the metadata. Tom Bentley raised this in point 4. and since it's a not strictly necessary optimisation I updated the KIP to remove it back then. Do you think we should keep the optimisation? 38. Correct. I've updated the KIP. 39. I think I forgot to update this after I changed the proposal to say the meta.properties are automatically updated. I have updated this section to clarify that the broker will automatically update the file if possible. A new logdir can be added while there are other, offline logdirs, as long as the set of UUIDs in `directory.ids` is expanded to include the new one. So the length of UUIDs in `directory.ids` and paths in `log.dirs` should always match. It is important that the broker be able to distinguish between UUIDs for logdirs that are offline, vs UUIDs for logdirs that were removed from configuration. If the broker starts up, configured with two logdirs, each logdir contains a meta.properties file indicating three different UUIDs under `directory.ids`, but only one of the configured logdirs is accessible (online), then it is not possible for the broker to automatically update the file, as it won't be able to distinguish between the UUID for the offline logdir and the removed logdir. In this case the broker should fail to start. The operator can either bring the offline logdir back up, restore the log.dirs configuration or manually update meta.properties. 40. Indeed. What I meant to say here is that the controller should not accept broker registration requests that do not indicate any online logdir UUIDs. We don't expect the broker would send these anyway. During the upgrade from non JBOD we could allow brokers to register without specifying any logdir UUID (online or offline). But thinking about this again now, I don't think it will be necessary — this idea was from before the metadata.version feature flag change was introduced. BrokerRegistrationRequest should only include logdir UUIDs after all servers are upgraded, and by then all logdirs will have an UUID assigned. I've updated the KIP to clarify that BrokerRegistrationRequest must always include some online logdir UUID. Best, -- Igor