Hi All, I'd like to start a discussion about exposing count gauge metrics for the replica fetcher and log cleaner thread counts. It isn't a long KIP and the motivation is very simple: monitoring the thread counts in these cases would help with the investigation of various issues and might help in preventing more serious issues when a broker is in a bad state. Such a scenario that we seen with users is that their disk fills up as the log cleaner died for some reason and couldn't recover (like log corruption). In this case an early warning would help in the root cause analysis process as well as enable detecting and resolving the problem early on.
The KIP is here: https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics I'd be happy to receive any feedback on this. Regards, Viktor