Hi friends! :) I believe we currently have a gap in KafkaConsumer metrics for errors since the KafkaConsumer is complex and are many places where things can go wrong. Currently, these failures are logged and certain ones can be inferred from the existing metrics (ex. heartbeat-rate).
This KIP seeks to improve monitoring and alerting for the consumer by providing metrics for the Fetcher class. https://cwiki.apache.org/confluence/display/KAFKA/KIP-356%3A+Add+KafkaConsumer+fetch-error-rate+and+fetch-error-total+metrics There are also a few other places in the Fetcher where errors may happen (parsing completed fetches, offset requests, etc) but it may be appropriate to monitor them in separate metrics. Any thoughts? Thanks! Regards, Kevin