[ 
https://issues.apache.org/jira/browse/KAFKA-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Randall Hauch updated KAFKA-6505:
---------------------------------
    Labels: needs-kip  (was: )

> Add simple raw "offset-commit-failures", "offset-commits" and 
> "offset-commit-successes" count metric
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6505
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6505
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions: 1.0.0
>            Reporter: Per Steffensen
>            Priority: Minor
>              Labels: needs-kip
>
> MBean 
> "kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x" 
> has several attributes. Most of them seems to be avg/max/pct over the entire 
> lifetime of the process. They are not very useful when monitoring a system, 
> where you typically want to see when there have been problems and if there 
> are problems right now.
> E.g. I would like to expose to an administrator when offset-commits have been 
> failing (e.g. timing out) including if they are failing right now. It is 
> really hard to do that properly, just using attribute 
> "offset-commit-failure-percentage". You can expose a number telling how much 
> the percentage has changed between two consecutive polls of the metric - if 
> it changed to the positive side, we saw offset-commit failures, and if it 
> changed to the negative side (or is stable at 0) we saw offset-commit success 
> - at least as long as the system has not been running for so long that a 
> single failing offset-commit does not even change the percentage. But it is 
> really odd, to do it this way.
> *I would like to just see an attribute "offset-commit-failures" just counting 
> how many offset-commits have failed, as an ever-increasing number. Maybe also 
> attributes "offset-commits" and "offset-commit-successes". Then I can do a 
> delta between the two last metric-polls to show how many 
> offset-commit-attempts have failed "very recently". Let this ticket be about 
> that particular added attribute (or the three added attributes).*
> Just a note on metrics IMHO (should probably be posted somewhere else):
> In general consider getting rid of stuff like avg, max, pct over the entire 
> lifetime of the process - current state is what interests people, especially 
> when it comes to failure-related metrics (failure-pct over the lifetime of 
> the process is not very useful). And people will continuously be polling and 
> storing the metrics, so we will have a history of "current state" somewhere 
> else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring 
> tools can do all the avg, max, pct for you based on a time-series of 
> metrics-poll-results - and they can do it for periods of your choice (e.g. 
> average over the last minute or 5 minutes) - have a look at Prometheus PromQL 
> (e.g. used through Grafana). Just expose the raw number and let the 
> average/max/min/pct calculation be done on the collect/presentation side. 
> Only do "advanced" stuff for cases that are very interesting and where it 
> cannot be done based on simple raw number (e.g. percentiles), and consider 
> whether doing it for fairly short intervals is better than for the entire 
> lifetime of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to