Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Igor Soarez Tue, 28 Nov 2023 07:19:32 -0800

Hi everyone,

There have been a number of further changes.

I have updated the KIP to reflect them, but for reference,
I'd also like to update this thread with a summary.

1. The reserved Uuids and their names for directories have changed.
The first 100 Uuids are reserved for future use.

2. During the ZooKeeper to KRaft migration, if a broker still
configured in ZK mode has any log directory offline, it will
shutdown and refuse to startup. The expectation is that this
escalation from a log directory's unavailability to the entire
broker's unavailability will be temporary, limited to the migration
period. And that the simplification will help develop, test and
support this feature.

3. The representation of replica directories in metadata records is
no longer tightly coupled with the respective broker IDs. Instead of
replacing the int[] replicas field in PartitionRecord and
PartitionChangeRecord, we are instead introducing a new field Uuid[]
named directories, that should be kept in the same size and order as
the existing replicas field. See
https://github.com/apache/kafka/pull/14290 for further details.

4. Assignments that are respective to failed log directories are no
longer prioritized. Previously the KIP proposed prioritizing
assignments that related to failed log directories, aiming to
synchronize the necessary replica to directory mapping on the
controller before handling the directory failure. Recently, we have
decided to remove any prioritization of these assignments, as
delaying the reporting of directory failures is considered
detrimental for any reason

5. Uuids for log directories that failed after startup are always
included in every broker heartbeat request. Previously the KIP
proposed sending Uuids for failed directories in the broker
heartbeat until a successful reply is received. However, due to the
overload mode handling of broker heartbeats, because broker
heartbeat requests may receive a successful response without being
fully processed, it is preferable to always send the cumulative list
of directory IDs that have failed since startup. In the future, this
list can be trimmed to remove directory IDs that are seen to be
removed from the broker registration, as the broker catches up with
metadata.

6. The proposal to shutdown the broker log.dir.failure.timeout.ms
after not being able to communicate that some log directory is
offline is now more of an open question. It's unclear if this will
actually be necessary.

Please share if you have any thoughts.

Best,

--
Igor

On Tue, Oct 10, 2023, at 5:28 AM, Igor Soarez wrote:
> Hi Colin,
> 
> Thanks for the renaming suggestions. UNASSIGNED is better then
> UNKNOWN, MIGRATING is also better than SELECTED and I don't
> expect it to be used outside of the migration phase.
> LOST can also work instead of OFFLINE, but I think there will
> be other uses for this value outside of the migration, like
> in the intra-broker replica movement edge cases described in the KIP.
> I've updated the KIP and also filed a tiny PR with your suggestion,
> except I'm keeping the description of LOST more broad than just
> scoped to the migration.
> 
>   https://github.com/apache/kafka/pull/14517
> 
> 
> The KIP already proposes that the broker does not want to unfence
> until it has confirmed all the assignments are communicated
> with the controller. And you're right about the interaction
> with ReplicaManager, we definitely don't want RPCs coming
> out of there. My intention is to introduce a new manager, as you
> suggest, with its own event loop, that batches and prioritizes
> assignment and dir failure events, called DirectoryEventManager.
> There's already an open PR, perhaps you could have a look?
> 
>   KAFKA-15357: Aggregate and propagate assignments and logdir failures
>   https://github.com/apache/kafka/pull/14369
> 
> 
> > With regard to the failure detection "gap" during hybrid mode: the
> > kraft controller sends a full LeaderAndIsrRequest to the brokers
> > that are in hybrid mode, right? And there is a per-partition
> > response as well. Right now, we don't pay attention to the error
> > codes sent back in the response. But we could. Any replica with an
> > error could be transitioned from MIGRATING -> LOST, right? That
> > would close the failure detection gap.
> 
> Almost. The missing bit is that the controller would also need to
> watch the /log_dir_event_notification znode, and on any change
> read the broker ID in the value and send a full LeaderAndIsrRequest
> to the respective broker. Without this watch, it might be a very long
> time between a dir failing and the controller sending
> LeaderAndIsrRequests covering every partition hosted in the failed dir.
> 
> Watching the znode is a read-only action, and it doesn't seem like
> a big thing to add this watch plus the error handling.
> Maybe extending the ZK Controller compatibility functionality in this
> way isn't such a bad option after all?
> 
> 
> Best,
> 
> --
> Igor
>

Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to