Hi everyone,

Earlier today Colin, Ron, Proven and I had a chat about this work.
We discussed several aspects which I’d like to share here.

## A new reserved UUID

We'll reserve a third UUID to indicate an unspecified dir,
but one that is known to be selected. As opposed to the
default UNKNOWN_DIR (ZERO_UUID) which is used for new replicas,
which may or may not have been placed in a some directory,
this new UUID can disambiguate transition scenarios where
we previously did not have any directory assignment
information — e.g. single-logdir KRaft into JBOD KRaft,
or ZK JBOD to KRaft JBOD mode.

Unless anyone has a better suggestion for naming or any objection,
I'll update the KIP to designate the following reserved UUIDs:

  * UNKNOWN_DIR  - new Uuid(0L, 0L)
  * OFFLINE_DIR  - new Uuid(0L, 1L)
  * SELECTED_DIR - new Uuid(0L, 2L) <-- new

When transitioning to the directory assignment feature,
without any previous directory assignment state, the
controller can assign all existing partitions in the broker
to SELECTED_DIR, to distinguish them from new partitions.

When a log directory is offline, it is important that the
broker does not replace the offline partitions in the remaining
online directories. So, if some directory is offline, and
some partition is missing the directory assignment, then
it is important to distinguish new partitions from old ones.
Old partitions may already exist in the offline dirs, but
new partitions can safely be placed in the available (online) dirs.
In ZK mode, the `isNew` flag in the LeaderAndIsr request
serves this purpose. And for KRaft this KIP proposes keeping
the broker in fenced mode until all initial assignments are
known. But this additional UUID serves as an additional
signal and covers a gap in the ZK->KRaft migration stage,
where ZK brokers do not support fencing state.

SELECTED_DIR is always a temporary transition state, which
is due to be resolved by the broker. When catching up with
metadata, for any partitions associated with SELECTED_DIR:

  * If the partition is found in some directory, AssignReplicasToDirs
  is used to correct the assignment to the actual dir.

  * If the partition is not found, and no directory is offline,
  a directory is selected, and AssignReplicasToDirs is used to
  correct the assignment to the chosen directory.

  * If the partition is not found and some directory is offline,
  the broker assumes that the partition must be in one of the
  offline dirs and AssignReplicasToDirs is used to converge the state
  to OFFLINE_DIR.

This contrasts with UNKNOWN_DIR, for which brokers always select
a directory, regardless of the online/offline state of any log dirs.

## Reserving a pool of non-designated UUIDs for future use

It’s probably a good idea to reserve a bunch of UUIDs
for future use in directory assignments. The decision
is essentially costless right now, and it may prove to
be useful in the future. The first 100 UUIDs (including
the 3 already designated above) will be reserved for future use.

## Dir failures during ZK->KRaft migration

The KRaft controller ZK compatibility controller functionality
does not currently implement dir failure handling. So we need
to select a strategy to deal with dir failures during the migration.

We discussed different options:

  a) If a migrating ZK mode broker encounters a directory failure,
  it will shutdown. While this degrades failure handling during,
  the temporary migration window, it is a useful simplification.
  This is an attractive option, and it isn't ruled out, but it
  is also not clear that it is necessary at this point.

  b) Extending the ZK Controller compatibility functionality in
  KRaft controllers to watch the /log_dir_event_notification
  znode, and rely on LeaderAndIsr requests for dir failure handling,
  same as ZK Controllers do. As there is a desire to limit the scope
  of the compatibility functionality, this option looks less attractive.

  c) Extend the ZK mode broker functionality during the migration
  to both send AssignReplicasToDirs populating the assignment state
  earlier, and propagate dir failures in the heartbeat in the same
  way the KIP proposes regular KRaft brokers do as well.
  There are several phases in the ZK->KRaft migration, and part of
  the process requires ZK mode brokers to send BrokerRegistration
  and BrokerHeartbeat requests already, so this doesn't look like
  a big change, and seems to be the preferred option.

If you have any thoughts or questions on any of these matters,
please let me know.

Best,

--
Igor

Reply via email to