Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Colin McCabe Mon, 09 Oct 2023 10:39:19 -0700

On Fri, Oct 6, 2023, at 18:30, Igor Soarez wrote:
> Hi Colin,
>
>> I would call #2 LOST. It was assigned in the past, but we don't know where.
>> I see that you called this OFFLINE). This is not really normal...
>> it should happen only when we're migrating from ZK mode to KRaft mode,
>> or going from an older KRaft release with multiple directories to a
>> post-JBOD release.
>
> What you refer to as #2 LOST is actually what I named SELECTED,
> as in: a directory has already been _selected_ sometime before,
> we just don't know which one yet.
>
> In the mean time this change has already been merged, but let me know
> if you feel strongly about the naming here as I'm happy to rename
> SELECTED_DIR to LOST_DIR in a new PR.
> https://github.com/apache/kafka/pull/14291
>
>> As for the third state -- I'm not sure why SELECTED_DIR needs to exist.
>
> The third state (actually it is ordered second) - OFFLINE_DIR - conveys
> that a replica is assigned to an unspecified offline directory.
>
> This can be used by the broker in the following way:
>
>   * When catching up with metadata, seeing that one of it's partitions
>   is mapped to SELECTED_DIR, and it cannot find that partition in
>   any of the online log directories, and at least one log dir is offline,
>   then the broker sends AssignReplicasToDirs to converge the assignment
>   to OFFLINE_DIR
>
>   * If a log directory failure happens during an intra-broker (across dir)
>   replica movement, after sending AssignReplicasToDirs with the new UUID,
>   and before the future replica catches up again. (there's a section
>   in the KIP about this).
>
> We could just use a random UUID, as if a replica is assigned to a dir
> that is not in the broker's registration online dirs set then it is
> considered offline by controllers and metadata cache, but using a
> reserved UUID feels cleaner.
>


Hi Igor,

Thanks. I remember the third case now. Basically "unassigned" can transition 
either to the actual assigned directory, or to a special reserved directory ID 
that indicates that the broker can't find it. Maybe this is the one we should 
be calling "lost" :)

How do you feel about the following names for special directory IDs?

MIGRATING : during ZK migration or during migration from an older KRaft 
metadata version, ALL directory IDs get set to this initially. The expectation 
is that the replica exists somewhere on the given broker, but due to migration 
we don't know where ... yet.

UNASSIGNED : the replica was just created. Due to the fact that we're in JBOD 
mode and we want the broker itself to choose the directory, we set the 
directory ID to this. (If the broker only has a single active directory, the 
controller just sets that directory ID initially, and skips UNASSIGNED.)

LOST : a replica was in MIGRATING state, but the broker can't find it anywhere. 
The broker then sets it to LOST.

MIGRATING and LOST only get used for migrations; UNASSIGNED Is the common one. 
For simplicity we can set MIGRATING to the all-zeros UUID. Since UUID fields in 
records will get this by default.

>> I think we need a general mechanism for checking that replicas are
>> in the directories we expect and sending an RPC to the controller
>> if they are not. A mechanism like this will automatically get rid
>> of the LOST replicas just as part of normal operation -- nothing
>> special required.
>
> Thanks for pointing this out, I forgot to put in the notes in my
> previous email that we discussed this too.
>
> The KIP proposes this is done when catching up with metadata,
> but you also suggested we extend the stray replica detection
> mechanism to also check for these inconsistencies. I think
> this is a good idea, and we'll look into that as well.
>

Yes, I think we are on the same page here.

What I was proposing was that there is some piece of code that handles 
reconciling the controller's view of where replicas are with the broker's view. 
Then we could have the broker wait until that code is done with its work before 
unfencing. It's not too different from how we wait for metadata to be caught up 
before requesting unfencing.

Probably the messy thing is handling the interaction between this code and 
ReplicaManager. Maybe it would work best if the interaction was sort of 
one-way: if ReplicaManager sees a discrepancy, it asks this new manager code to 
correct it. After all, we don't want to have ReplicaManager responsible for 
sending RPCs to the controller. It already has enough to do! And there are 
issues like handling retries and so on.

With regard to the failure detection "gap" during hybrid mode: the kraft 
controller sends a full LeaderAndIsrRequest to the brokers that are in hybrid 
mode, right? And there is a per-partition response as well. Right now, we don't 
pay attention to the error codes sent back in the response. But we could. Any 
replica with an error could be transitioned from MIGRATING -> LOST, right? That 
would close the failure detection gap.

best,
Colin

Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to