Hey, just chiming in regarding the ZK migration piece.

Generally speaking, one of the design goals of the migration was to have
minimal changes on the ZK brokers and especially the ZK controller. Since
ZK mode is our safe/well-known fallback mode, we wanted to reduce the
chances of introducing bugs there. Following that logic, I'd prefer option
(a) since it does not involve changing any migration code or (much) ZK
broker code. Disk failures should be pretty rare, so this seems like a
reasonable option.

a) If a migrating ZK mode broker encounters a directory failure,
>   it will shutdown. While this degrades failure handling during,
>   the temporary migration window, it is a useful simplification.
>   This is an attractive option, and it isn't ruled out, but it
>   is also not clear that it is necessary at this point.


If a ZK broker experiences a disk failure before the metadata is migrated,
it will prevent the migration from happening. If the metadata is already
migrated, then you simply have an offline broker.

If an operator wants to minimize the time window of the migration, they can
simply do the requisite rolling restarts one after the other.

1) Provision KRaft controllers
2) Configure ZK brokers for migration and do rolling restart (migration
happens automatically here)
3) Configure ZK brokers as KRaft and do rolling restart

This reduces the time window to essentially the time it takes to do two
rolling restarts of the cluster. One the brokers are in KRaft mode, they
won't have the "shutdown if log dir fails" behavior.



One question with this approach is how the KRaft controller learns about
the multiple log directories after the broker is restarted in KRaft mode.
If I understand the design correctly, this would be similar to a single
directory kraft broker being reconfigured as a multiple directory broker.
That is, the broker sees that the PartitionRecords are missing the
directory assignments and then sends AssignReplicasToDirs to the controller.

Thanks!
David

Reply via email to