Hi Igor. Thanks for all your work here. Before I can vote, I have the following questions/comments about the KIP:
> When multiple log.dirs are configured, a new property will be included > in meta.properties — directory.id — which will identify each log directory > with a UUID. The UUID is randomly generated for each log directory. Since every log directory gets a random UUID assigned, even if just one log dir is configured in the Broker, I think the above should not be qualified with the phrase “When multiple log.dirs are configured". Similarly: > When multiple log.dirs are configured, a new property — directory.id — will be > expected in the meta.properties file in each log directory configured under > log.dirs. I think the qualification "When multiple log.dirs are configured" should be removed. Colin had mentioned that in PartitionRecord he would prefer a new (tagged) array for replica UUIDs, rather than creating the ReplicaAssignment array. While adding to an RPC is arguably less intrusive than replacing, I am inclined to disagree with Colin's suggestion for the following reason. We have agreed to not specify the log dir uuid in the case where a replica only has one registered log dir. If we were to add a new tagged array, it would become necessary to specify all replica log dir uuids for all replicas in the PartitionRecord if any one such replica had more than one log dir configured. By creating the ReplicaAssignment array we can just specify the uuid -- or not -- for each replica based on whether that replica itself has multiple registered log dirs or not. The documentation of the new config “log.dir.failure.timeout.ms" should indicate that shutdown will only occur if the replica is the leader for at least one replica in the failed log directory. The text in the KIP says this later on, so I think the config documentation should have it as well. > a logical update to all partitions in that broker takes place, > assigning the replica's directory to the single directory previously > registered – > i.e. it is assumed that all replicas are still in the same directory, and > this transition > to JBOD avoids creating partition change records. I would like to confirm that upon writing a snapshot each PartitionRecord will now contain an explicit directory for such replicas since there is no longer just 1 possibility. It might be good to state it in the KIP for clarity assuming this statement is correct. The “Metadata caching” section states that replicas will also be considered offline if the replica references a log directory UUID that is not present in the hosting Broker's latest registration EXCEPT FOR the case when there is just one registered directory and the presence of any offline log dirs is not indicated (i.e. OfflineLogDirs is false). I wonder about the corner case where a broker that previously had multiple log dirs is restarted with a new config that specifies just a single log directory. What would happen here? If the broker were not the leader then perhaps it would replicate the data into the single log directory. What would happen if it were the leader of a partition that had been marked as offline? Would we have data loss even if other replicas still had data? Ron On Tue, Sep 5, 2023 at 7:46 AM ziming deng <dengziming1...@gmail.com> wrote: > > Hi, Igor > I’m +1(binding) for this, looking forward the PR. > > -- > Best, > Ziming > > > On Jul 26, 2023, at 01:13, Igor Soarez <soa...@apple.com.INVALID> wrote: > > > > Hi everyone, > > > > Following a face-to-face discussion with Ron and Colin, > > I have just made further improvements to this KIP: > > > > > > 1. Every log directory gets a random UUID assigned, even if just one > > log dir is configured in the Broker. > > > > 2. All online log directories are registered, even if just one if > > configured. > > > > 3. Partition-to-directory assignments are only performed if more than > > one log directory is configured/registered. > > > > 4. A successful reply from the Controller to a AssignReplicasToDirsRequest > > is taken as a guarantee that the metadata changes are > > successfully persisted. > > > > 5. Replica assignments that refer log directories pending a failure > > notification are prioritized to guarantee the Controller and Broker > > agree on the assignments before acting on the failure notification. > > > > 6. The transition from one log directory to multiple log directories > > relies on a logical update to efficiently update directory assignments > > to the previously registered single log directory when that's possible. > > > > I have also introduced a configuration for the maximum time the broker > > will keep trying to send a log directory notification before shutting down. > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft > > > > Best, > > > > -- > > Igor > > >