Per Steffensen created KAFKA-6505:
-------------------------------------
Summary: Add simple raw "offset-commit-failures", "offset-commits"
and "offset-commit-successes" count metric
Key: KAFKA-6505
URL: https://issues.apache.org/jira/browse/KAFKA-6505
Project: Kafka
Issue Type: Improvement
Reporter: Per Steffensen
MBean
"kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x"
has several attributes. Most of them seems to be avg/max/pct over the entire
lifetime of the process. They are not very useful when monitoring a system,
where you typically want to see when there have been problems and if there are
problems right now.
E.g. I would like to expose to an administrator when offset-commits have been
failing (e.g. timing out) including if they are failing right now. It is really
hard to do that properly, just using attribute
"offset-commit-failure-percentage". You can expose a number telling how much
the percentage has changed between two consecutive polls of the metric - if it
changed to the positive side, we saw offset-commit failures, and if it changed
to the negative side (or is stable at 0) we saw offset-commit success - at
least as long as the system has not been running for so long that a single
failing offset-commit does not even change the percentage. But it is really
odd, to do it this way.
*I would like to just see an attribute "offset-commit-failures" just counting
how many offset-commits have failed, as an ever-increasing number. Maybe also
attributes "offset-commits" and "offset-commit-successes". Then I can do a
delta between the two last metric-polls to show how many offset-commit-attempts
have failed "very recently". Let this ticket be about that particular added
attribute (or the three added attributes).*
Just a note on metrics IMHO (should probably be posted somewhere else):
In general consider getting rid of stuff like avg, max, pct over the entire
lifetime of the process - current state is what interests people, especially
when it comes to failure-related metrics (failure-pct over the lifetime of the
process is not very useful). And people will continuously be polling and
storing the metrics, so we will have a history of "current state" somewhere
else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring tools
can do all the avg, max, pct for you based on a time-series of
metrics-poll-results - and they can do it for periods of your choice (e.g.
average over the last minute or 5 minutes) - have a look at Prometheus PromQL
(e.g. used through Grafana). Just expose the raw number and let the
average/max/min/pct calculation be done on the collect/presentation side. Only
do "advanced" stuff for cases that are very interesting and where it cannot be
done based on simple raw number (e.g. percentiles), and consider whether doing
it for fairly short intervals is better than for the entire lifetime of the
process.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)