Per Steffensen created KAFKA-6505:
-------------------------------------

             Summary: Add simple raw "offset-commit-failures", "offset-commits" 
and "offset-commit-successes" count metric
                 Key: KAFKA-6505
                 URL: https://issues.apache.org/jira/browse/KAFKA-6505
             Project: Kafka
          Issue Type: Improvement
            Reporter: Per Steffensen


MBean 
"kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x" 
has several attributes. Most of them seems to be avg/max/pct over the entire 
lifetime of the process. They are not very useful when monitoring a system, 
where you typically want to see when there have been problems and if there are 
problems right now.

E.g. I would like to expose to an administrator when offset-commits have been 
failing (e.g. timing out) including if they are failing right now. It is really 
hard to do that properly, just using attribute 
"offset-commit-failure-percentage". You can expose a number telling how much 
the percentage has changed between two consecutive polls of the metric - if it 
changed to the positive side, we saw offset-commit failures, and if it changed 
to the negative side (or is stable at 0) we saw offset-commit success - at 
least as long as the system has not been running for so long that a single 
failing offset-commit does not even change the percentage. But it is really 
odd, to do it this way.

*I would like to just see an attribute "offset-commit-failures" just counting 
how many offset-commits have failed, as an ever-increasing number. Maybe also 
attributes "offset-commits" and "offset-commit-successes". Then I can do a 
delta between the two last metric-polls to show how many offset-commit-attempts 
have failed "very recently". Let this ticket be about that particular added 
attribute (or the three added attributes).*



Just a note on metrics IMHO (should probably be posted somewhere else):

In general consider getting rid of stuff like avg, max, pct over the entire 
lifetime of the process - current state is what interests people, especially 
when it comes to failure-related metrics (failure-pct over the lifetime of the 
process is not very useful). And people will continuously be polling and 
storing the metrics, so we will have a history of "current state" somewhere 
else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring tools 
can do all the avg, max, pct for you based on a time-series of 
metrics-poll-results - and they can do it for periods of your choice (e.g. 
average over the last minute or 5 minutes) - have a look at Prometheus PromQL 
(e.g. used through Grafana). Just expose the raw number and let the 
average/max/min/pct calculation be done on the collect/presentation side. Only 
do "advanced" stuff for cases that are very interesting and where it cannot be 
done based on simple raw number (e.g. percentiles), and consider whether doing 
it for fairly short intervals is better than for the entire lifetime of the 
process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to