Hi, Igor, Thanks for the excellent, thorough and very comprehensive KIP.
Although not directly in scope of the KIP, but related to it, I would have the following question about a potential future work on disk degradation. Today, what characterises as a disk failure in Kafka is an I/O exception surfaced by the JDK libraries. There are other types of (more or less) soft failures where a disk (or the system behind its abstraction) remains available, but experiences degradation, typically in the form of elevated I/O latency. Currently, Kafka is not made aware of the “health” of a disk. It may be useful to let Kafka know about the QoS of its disks so that it can take actions which could improve availability, e.g. via leader movements. The KIP builds upon the existing concepts of online and offline states for log directories, and the propagation of a disk failure via the broker heartbeat and registration relies on the offline(d) directories list. I wonder if it could make sense to extend the definition of state of a log directory beyond online/offline to be able to refer to disk degradation. In which case, the new fields added to the broker heartbeat and registration requests may be the place where this alternative state can also be conveyed. Perhaps the changes to the RPCs could be designed to accommodate this new type of semantic in the future. What do you think? Thanks, Alexandre Le mer. 26 avr. 2023 à 14:05, Igor Soarez <soa...@apple.com.invalid> a écrit : > > Thank you for another review Ziming, much appreciated! > > 1. and 2. You are correct, it would be a big and perhaps strange difference. > Since our last exchange of emails, the proposal has changed and now it > does follow your suggestion to bump metadata.version. > The KIP mentions it under "Compatibility, Deprecation, and Migration Plan". > > 3. I tried to describe this under "Controller", under the heading > "Handling replica assignments", but perhaps it could be improved. > Let me know what you think. > > Best, > > -- > Igor >