Hi Mahsa, Thanks for the KIP. I agree that this metric would help with debugging the controller's performance and availability.
In the description of the metric you state: "The idle ratio measures the proportion of time the controller thread is not actively processing an event over the last 30 seconds." How are you planning to implement this? Kafka can't control how often the metric gets measured and we don't want to schedule an event every 30 seconds to measure a metric. The metric should instead report the idleness based on the actual measurement interval. The KRaft module already implements this type of metric in org.apache.kafka.raft.internals.TimeRatio. We will want to leverage and use the same implementation for this new metric. I see that QuorumControllerMetrics uses KafkaYammerMetrics while TimeRatio uses Kafka Metrics. It should still be possible to have a shared implementation with two facade or two interfaces for each metrics registry. Thanks, -- -José