Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Ron Dagostino Sun, 10 Sep 2023 14:33:35 -0700

Hi Igor.  Thanks for all your work here.  Before I can vote, I have
the following questions/comments about the KIP:

> When multiple log.dirs are configured, a new property will be included
> in meta.properties — directory.id — which will identify each log directory
> with a UUID. The UUID is randomly generated for each log directory.

Since every log directory gets a random UUID assigned, even if just
one log dir is configured in the Broker, I think the above should not
be qualified with the phrase “When multiple log.dirs are configured".

Similarly:

> When multiple log.dirs are configured, a new property — directory.id — will be
> expected in the meta.properties file in each log directory configured under 
> log.dirs.

I think the qualification "When multiple log.dirs are configured"
should be removed.

Colin had mentioned that in PartitionRecord he would prefer a new
(tagged) array for replica UUIDs, rather than creating the
ReplicaAssignment array.  While adding to an RPC is arguably less
intrusive than replacing, I am inclined to disagree with Colin's
suggestion for the following reason.  We have agreed to not specify
the log dir uuid in the case where a replica only has one registered
log dir.  If we were to add a new tagged array, it would become
necessary to specify all replica log dir uuids for all replicas in the
PartitionRecord if any one such replica had more than one log dir
configured.  By creating the ReplicaAssignment array we can just
specify the uuid -- or not -- for each replica based on whether that
replica itself has multiple registered log dirs or not.

The documentation of the new config “log.dir.failure.timeout.ms"
should indicate that shutdown will only occur if the replica is the
leader for at least one replica in the failed log directory.  The text
in the KIP says this later on, so I think the config documentation
should have it as well.

> a logical update to all partitions in that broker takes place,
> assigning the replica's directory to the single directory previously 
> registered –
> i.e. it is assumed that all replicas are still in the same directory, and 
> this transition
> to JBOD avoids creating partition change records.

I would like to confirm that upon writing a snapshot each
PartitionRecord will now contain an explicit directory for such
replicas since there is no longer just 1 possibility.  It might be
good to state it in the KIP for clarity assuming this statement is
correct.

The “Metadata caching” section states that replicas will also be
considered offline if the replica references a log directory UUID that
is not present in the hosting Broker's latest registration EXCEPT FOR
the case when there is just one registered directory and the presence
of any offline log dirs is not indicated (i.e. OfflineLogDirs is
false).  I wonder about the corner case where a broker that previously
had multiple log dirs is restarted with a new config that specifies
just a single log directory.  What would happen here?  If the broker
were not the leader then perhaps it would replicate the data into the
single log directory.  What would happen if it were the leader of a
partition that had been marked as offline?  Would we have data loss
even if other replicas still had data?

Ron

On Tue, Sep 5, 2023 at 7:46 AM ziming deng <dengziming1...@gmail.com> wrote:
>
> Hi, Igor
> I’m +1(binding) for this, looking forward the PR.
>
> --
> Best,
> Ziming
>
> > On Jul 26, 2023, at 01:13, Igor Soarez <soa...@apple.com.INVALID> wrote:
> >
> > Hi everyone,
> >
> > Following a face-to-face discussion with Ron and Colin,
> > I have just made further improvements to this KIP:
> >
> >
> > 1. Every log directory gets a random UUID assigned, even if just one
> >   log dir is configured in the Broker.
> >
> > 2. All online log directories are registered, even if just one if 
> > configured.
> >
> > 3. Partition-to-directory assignments are only performed if more than
> >   one log directory is configured/registered.
> >
> > 4. A successful reply from the Controller to a AssignReplicasToDirsRequest
> >   is taken as a guarantee that the metadata changes are
> >   successfully persisted.
> >
> > 5. Replica assignments that refer log directories pending a failure
> >   notification are prioritized to guarantee the Controller and Broker
> >   agree on the assignments before acting on the failure notification.
> >
> > 6. The transition from one log directory to multiple log directories
> >   relies on a logical update to efficiently update directory assignments
> >   to the previously registered single log directory when that's possible.
> >
> > I have also introduced a configuration for the maximum time the broker
> > will keep trying to send a log directory notification before shutting down.
> >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft
> >
> > Best,
> >
> > --
> > Igor
> >
>

Re: [VOTE] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to