Hi, We (Heroku) are very excited about this KIP, as we've struggled a bit with controller stability recently. Having these additional metrics would be wonderful.
I'd like to ensure polling these metrics *doesn't* hold any locks etc, because, as noted in https://issues.apache.org/jira/browse/KAFKA-5120, that lock can be held for quite some time. This may become not an issue as of KAFKA-5028 though. Lastly, I'd love to see some metrics around how long the controller spends inside its lock. We've been tracking an issue ( https://issues.apache.org/jira/browse/KAFKA-5116) where it can hold the lock for many, many minutes in a zk client listener thread when responding to a single request. I'm not sure how that plays into https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume will land before this metrics patch), but it feels like there will be equivalent problems ("how long does it spend processing any individual message from the queue, broken down by message type"). These are minor improvements though, the addition of more metrics to the controller is already going to be very helpful. Thanks Tom Crayford Heroku Kafka On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma <ism...@juma.me.uk> wrote: > Hi all, > > We've posted "KIP-143: Controller Health Metrics" for discussion: > > https://cwiki.apache.org/confluence/display/KAFKA/KIP- > 143%3A+Controller+Health+Metrics > > Please take a look. Your feedback is appreciated. > > Thanks, > Ismael >