Thanks for posting these notes, Igor. I think we should definitely distinguish between these two cases:
1. this replica hasn't been assigned to a storage directory 2. we don't know which storage directory this replica was assigned to in the past I would call #1 UNASSIGNED. I see that you called it UNKNOWN in your message. This is normal. It jus means that the controller created this thing but the broker didn't put it somewhere yet. I would call #2 LOST. It was assigned in the past, but we don't know where. I see that you called this OFFLINE). This is not really normal... it should happen only when we're migrating from ZK mode to KRaft mode, or going from an older KRaft release with multiple directories to a post-JBOD release. The reason for distinguishing between these two cases is that we don't want to recreate LOST replicas. They probably are already on the broker in some directory, but we just don't know where. This is different than the case with UNASSIGNED, which should trigger the broker placing the replica somewhere and basically treating it like a new replica. As for the third state -- I'm not sure why SELECTED_DIR needs to exist. I think we need a general mechanism for checking that replicas are in the directories we expect and sending an RPC to the controller if they are not. A mechanism like this will automatically get rid of the LOST replicas just as part of normal operation -- nothing special required. I think the part that you were trying to address with the "can't have failures during migration" stipulation is that if we do things the obvious way, we will have LOST replicas for a while immediately after migration. But we won't know if those replicas are actually there or not. There are a bunch of ways to handle this... we'll have to think about what is the easiest. Writing the data to ZK while still in hybrid mode might even be on the table as an option. best, Colin On Thu, Oct 5, 2023, at 07:03, David Arthur wrote: > Hey, just chiming in regarding the ZK migration piece. > > Generally speaking, one of the design goals of the migration was to have > minimal changes on the ZK brokers and especially the ZK controller. Since > ZK mode is our safe/well-known fallback mode, we wanted to reduce the > chances of introducing bugs there. Following that logic, I'd prefer option > (a) since it does not involve changing any migration code or (much) ZK > broker code. Disk failures should be pretty rare, so this seems like a > reasonable option. > > a) If a migrating ZK mode broker encounters a directory failure, >> it will shutdown. While this degrades failure handling during, >> the temporary migration window, it is a useful simplification. >> This is an attractive option, and it isn't ruled out, but it >> is also not clear that it is necessary at this point. > > > If a ZK broker experiences a disk failure before the metadata is migrated, > it will prevent the migration from happening. If the metadata is already > migrated, then you simply have an offline broker. > > If an operator wants to minimize the time window of the migration, they can > simply do the requisite rolling restarts one after the other. > > 1) Provision KRaft controllers > 2) Configure ZK brokers for migration and do rolling restart (migration > happens automatically here) > 3) Configure ZK brokers as KRaft and do rolling restart > > This reduces the time window to essentially the time it takes to do two > rolling restarts of the cluster. One the brokers are in KRaft mode, they > won't have the "shutdown if log dir fails" behavior. > > > > One question with this approach is how the KRaft controller learns about > the multiple log directories after the broker is restarted in KRaft mode. > If I understand the design correctly, this would be similar to a single > directory kraft broker being reconfigured as a multiple directory broker. > That is, the broker sees that the PartitionRecords are missing the > directory assignments and then sends AssignReplicasToDirs to the controller. > > Thanks! > David