Hi All,

I'd like to start a discussion about exposing count gauge metrics for the
replica fetcher and log cleaner thread counts. It isn't a long KIP and the
motivation is very simple: monitoring the thread counts in these cases
would help with the investigation of various issues and might help in
preventing more serious issues when a broker is in a bad state. Such a
scenario that we seen with users is that their disk fills up as the log
cleaner died for some reason and couldn't recover (like log corruption). In
this case an early warning would help in the root cause analysis process as
well as enable detecting and resolving the problem early on.

The KIP is here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics

I'd be happy to receive any feedback on this.

Regards,
Viktor

Reply via email to