[ https://issues.apache.org/jira/browse/KAFKA-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Randall Hauch updated KAFKA-6505: --------------------------------- Labels: needs-kip (was: ) > Add simple raw "offset-commit-failures", "offset-commits" and > "offset-commit-successes" count metric > ---------------------------------------------------------------------------------------------------- > > Key: KAFKA-6505 > URL: https://issues.apache.org/jira/browse/KAFKA-6505 > Project: Kafka > Issue Type: Improvement > Components: KafkaConnect > Affects Versions: 1.0.0 > Reporter: Per Steffensen > Priority: Minor > Labels: needs-kip > > MBean > "kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x" > has several attributes. Most of them seems to be avg/max/pct over the entire > lifetime of the process. They are not very useful when monitoring a system, > where you typically want to see when there have been problems and if there > are problems right now. > E.g. I would like to expose to an administrator when offset-commits have been > failing (e.g. timing out) including if they are failing right now. It is > really hard to do that properly, just using attribute > "offset-commit-failure-percentage". You can expose a number telling how much > the percentage has changed between two consecutive polls of the metric - if > it changed to the positive side, we saw offset-commit failures, and if it > changed to the negative side (or is stable at 0) we saw offset-commit success > - at least as long as the system has not been running for so long that a > single failing offset-commit does not even change the percentage. But it is > really odd, to do it this way. > *I would like to just see an attribute "offset-commit-failures" just counting > how many offset-commits have failed, as an ever-increasing number. Maybe also > attributes "offset-commits" and "offset-commit-successes". Then I can do a > delta between the two last metric-polls to show how many > offset-commit-attempts have failed "very recently". Let this ticket be about > that particular added attribute (or the three added attributes).* > Just a note on metrics IMHO (should probably be posted somewhere else): > In general consider getting rid of stuff like avg, max, pct over the entire > lifetime of the process - current state is what interests people, especially > when it comes to failure-related metrics (failure-pct over the lifetime of > the process is not very useful). And people will continuously be polling and > storing the metrics, so we will have a history of "current state" somewhere > else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring > tools can do all the avg, max, pct for you based on a time-series of > metrics-poll-results - and they can do it for periods of your choice (e.g. > average over the last minute or 5 minutes) - have a look at Prometheus PromQL > (e.g. used through Grafana). Just expose the raw number and let the > average/max/min/pct calculation be done on the collect/presentation side. > Only do "advanced" stuff for cases that are very interesting and where it > cannot be done based on simple raw number (e.g. percentiles), and consider > whether doing it for fairly short intervals is better than for the entire > lifetime of the process. -- This message was sent by Atlassian JIRA (v7.6.3#76005)