Hi, Igor,

Thanks for the excellent, thorough and very comprehensive KIP.

Although not directly in scope of the KIP, but related to it, I would
have the following question about a potential future work on disk
degradation.

Today, what characterises as a disk failure in Kafka is an I/O
exception surfaced by the JDK libraries. There are other types of
(more or less) soft failures where a disk (or the system behind its
abstraction) remains available, but experiences degradation, typically
in the form of elevated I/O latency. Currently, Kafka is not made
aware of the “health” of a disk. It may be useful to let Kafka know
about the QoS of its disks so that it can take actions which could
improve availability, e.g. via leader movements.

The KIP builds upon the existing concepts of online and offline states
for log directories, and the propagation of a disk failure via the
broker heartbeat and registration relies on the offline(d) directories
list. I wonder if it could make sense to extend the definition of
state of a log directory beyond online/offline to be able to refer to
disk degradation. In which case, the new fields added to the broker
heartbeat and registration requests may be the place where this
alternative state can also be conveyed. Perhaps the changes to the
RPCs could be designed to accommodate this new type of semantic in the
future.

What do you think?

Thanks,
Alexandre

Le mer. 26 avr. 2023 à 14:05, Igor Soarez <soa...@apple.com.invalid> a écrit :
>
> Thank you for another review Ziming, much appreciated!
>
> 1. and 2. You are correct, it would be a big and perhaps strange difference.
> Since our last exchange of emails, the proposal has changed and now it
> does follow your suggestion to bump metadata.version.
> The KIP mentions it under "Compatibility, Deprecation, and Migration Plan".
>
> 3. I tried to describe this under "Controller", under the heading
> "Handling replica assignments", but perhaps it could be improved.
> Let me know what you think.
>
> Best,
>
> --
> Igor
>

Reply via email to