Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Igor Soarez Thu, 01 Dec 2022 10:18:58 -0800

Hi Jun,

Thank you for reviewing the KIP. Please find my replies to
your comments below.


10. Thanks for pointing out this typo; it has been corrected.


11. I agree that the additional delay in switching to the
future replica is undesirable, however I see a couple of
issues if we forward the request to the controller
as you describe:

   a) If the controller persists the change to the log directory
   assignment before the future replica has caught up and there
   is a failure in the original log directory then if the broker
   is a leader for the partition there will be no failover
   and the partition will become unavailable. It is not safe to
   call AssignReplicasToDirectories before the replica exists
   in the designated log directory.

   b) An existing behavior we'd like to maintain if possible is
   the ability to move partitions between log directories when a
   broker is offline, as it can be very useful to manage storage
   space. ZK brokers will load and accept logs in the new
   location after startup.
   Maintaining this behavior requires that the broker be able to
   override/correct assignments that are seen in the cluster metadata.
   i.e. the broker is the authority on log directory placement in case
   of mismatch.
   If we want to keep this feature and have the controller send log
   directory reassignments, we'll need a way to distinguish between
   mismatch due to offline movement and mismatch due to controller
   triggered reassignment.

To keep the delay low, instead of sending AlterReplicaLogDirs
within the lock the RPC can be queued elsewhere when the future
replica first catches up. ReplicaAlterLogDirsThread can keep
going and not switch the replicas yet.
Once the broker receives confirmation of the metadata change
it can then briefly block appends to the old replica and make the switch.
In the unlikely event that source log directory fails between the moment
AssignReplicasToDirectories is acknowledged by the controller and
before the broker is able to make the switch, then the broker
can issue AssignReplicasToDirectories of that replica back to the offline
log directory to let the controller know that the replica is actually
offline.
What do you think?


12. Indeed, the metadata.log.dir, if explicitly defined to a separate
directory should not be included in the directory UUID list sent
to the controller in broker registration and heartbeat requests.
I have updated the KIP to make this explicit.


13. Thank you for making this suggestion.
Let's address the different scenarios you enumerated:

 a) When enabling JBOD for an existing KRaft cluster

 In this scenario, the broker finds a single log directory configured
 in `log.dirs`, with an already existing `meta.properties`, which is
 simply missing `directory.id <http://directory.id/>` and `directory.ids`.

 It is safe to have the broker automatically generate the log dir
 UUID and update the `meta.properties` file. This removes any need
 to have extra steps in upgrading and enabling of this feature,
 so it is quite useful.


 b) Adding a log dir to an existing JBOD KRaft cluster

 In this scenario, the broker finds a shorter list of `directory.ids`
 in `meta.properties` than what is configured in `log.dirs`.

 KIP-631 introduced the requirement to run the command kafka-storage.sh
 to format storage directories.

 Currently, if the broker in KRaft mode cannot find `meta.properties`
 in each log directory it will fail at startup. KIP-785 proposes
 removing the need to run the format storage command, but it is
 still open. If new log directories are being added the storage
 command must be run anyway. So I don't think there will be any
 benefit in this case.


 c) Removing a log dir from an existing JBOD KRaft cluster

 In this scenario the broker finds a larger list of `directory.ids`
 in `meta.properties` than what is configured in `log.dirs`.

 The removal of log directories requires an explicit update to
 the `log.dirs` configuration, so it is also safe to have the
 broker automatically update `directory.ids` in `meta.properties`
 to remove the extra UUIDs. It's also useful to drop the requirement
 to run the storage command after removal of log directories from
 configuration, as it reduces operational burden.


 c) Upgrading a JBOD ZK cluster to KRaft

 In ZooKeeper brokers write `meta.properties` with `version=0`,
 but in KRaft mode brokers require `version=1`.
 According to the current discussion on KIP-866, the broker will
 automatically upgrade meta.properties from `version=0` to `version=1`.
 Assuming the meta.properties gets translated into `version=1` as part
 of KIP-866, then the broker should find `meta.properties` which are
 simply missing the two new fields: `directory.id <http://directory.id/>` and 
`directory.ids`.
 So it should be OK for the broker to generate UUIDs for each log
 directory and update the `meta.properties` file.


I have updated the KIP to remove the need to run the storage command
in these scenarios - apart from b) where it should already be a
requirement AFAICT - and describe how the broker will automatically
update `directory.id <http://directory.id/>` and `directory.ids` in 
`meta.properties` if
required.


14. Thank you for pointing this out. Yes, the controller should
simply delegate the log directory choice back to the broker,
by assigning `Uuid.ZERO` to all the replicas that were assigned
to removed log directories. I have updated the KIP under
"Controller" > "Broker registration".


15. Yes. Same as in "Metadata caching", replicas are considered
offline if the replica references a log directory which is not in the
list of online log directories for the broker ID hosting the replica.
This means the controller will not assign leadership those replicas.
This is described under "Controller" > "Handling log directory failures".


16. We'll need to bump metadata.version to gate use of the records
and RPC changes. I've added mention of this under the
"Compatibility, Deprecation, and Migration Plan" section.



Please let me know of any further feedback.

Best,

--
Igor

Re: [DISCUSS] KIP-858: Handle JBOD broker disk failure in KRaft

Reply via email to