[
https://issues.apache.org/jira/browse/KAFKA-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372470#comment-16372470
]
Ewen Cheslack-Postava commented on KAFKA-6505:
----------------------------------------------
[~steff1193] Technically required, but for anything fairly obvious, the KIP can
be mostly a formality (though the value in having the process is that
frequently seemingly simple improvements have important details and nuances
that are not immediately recognized).
I didn't notice a KIP for this yet, but for simple stuff like this the KIP
overhead is pretty minimal – basically just write up some notes on the change
such that people have a chance to evaluate it, see any important compatibility
notes, etc.
If any guidance on the KIP process would help, myself, [~rhauch],
[~wushujames], and I'm sure others would be happy to help. Having only skimmed,
these changes seem straightforward, so I assume the KIP would mostly just
breeze through review.
> Add simple raw "offset-commit-failures", "offset-commits" and
> "offset-commit-successes" count metric
> ----------------------------------------------------------------------------------------------------
>
> Key: KAFKA-6505
> URL: https://issues.apache.org/jira/browse/KAFKA-6505
> Project: Kafka
> Issue Type: Improvement
> Components: KafkaConnect
> Affects Versions: 1.0.0
> Reporter: Per Steffensen
> Priority: Minor
> Labels: needs-kip
>
> MBean
> "kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x"
> has several attributes. Most of them seems to be avg/max/pct over the entire
> lifetime of the process. They are not very useful when monitoring a system,
> where you typically want to see when there have been problems and if there
> are problems right now.
> E.g. I would like to expose to an administrator when offset-commits have been
> failing (e.g. timing out) including if they are failing right now. It is
> really hard to do that properly, just using attribute
> "offset-commit-failure-percentage". You can expose a number telling how much
> the percentage has changed between two consecutive polls of the metric - if
> it changed to the positive side, we saw offset-commit failures, and if it
> changed to the negative side (or is stable at 0) we saw offset-commit success
> - at least as long as the system has not been running for so long that a
> single failing offset-commit does not even change the percentage. But it is
> really odd, to do it this way.
> *I would like to just see an attribute "offset-commit-failures" just counting
> how many offset-commits have failed, as an ever-increasing number. Maybe also
> attributes "offset-commits" and "offset-commit-successes". Then I can do a
> delta between the two last metric-polls to show how many
> offset-commit-attempts have failed "very recently". Let this ticket be about
> that particular added attribute (or the three added attributes).*
> Just a note on metrics IMHO (should probably be posted somewhere else):
> In general consider getting rid of stuff like avg, max, pct over the entire
> lifetime of the process - current state is what interests people, especially
> when it comes to failure-related metrics (failure-pct over the lifetime of
> the process is not very useful). And people will continuously be polling and
> storing the metrics, so we will have a history of "current state" somewhere
> else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring
> tools can do all the avg, max, pct for you based on a time-series of
> metrics-poll-results - and they can do it for periods of your choice (e.g.
> average over the last minute or 5 minutes) - have a look at Prometheus PromQL
> (e.g. used through Grafana). Just expose the raw number and let the
> average/max/min/pct calculation be done on the collect/presentation side.
> Only do "advanced" stuff for cases that are very interesting and where it
> cannot be done based on simple raw number (e.g. percentiles), and consider
> whether doing it for fairly short intervals is better than for the entire
> lifetime of the process.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)