[jira] [Commented] (KAFKA-6505) Add simple raw "offset-commit-failures", "offset-commits" and "offset-commit-successes" count metric

Ewen Cheslack-Postava (JIRA) Wed, 21 Feb 2018 22:33:34 -0800

    [ 
https://issues.apache.org/jira/browse/KAFKA-6505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16372470#comment-16372470
 ]


Ewen Cheslack-Postava commented on KAFKA-6505:
----------------------------------------------

[~steff1193] Technically required, but for anything fairly obvious, the KIP can 
be mostly a formality (though the value in having the process is that 
frequently seemingly simple improvements have important details and nuances 
that are not immediately recognized).

I didn't notice a KIP for this yet, but for simple stuff like this the KIP 
overhead is pretty minimal – basically just write up some notes on the change 
such that people have a chance to evaluate it, see any important compatibility 
notes, etc.

If any guidance on the KIP process would help, myself, [~rhauch], 
[~wushujames], and I'm sure others would be happy to help. Having only skimmed, 
these changes seem straightforward, so I assume the KIP would mostly just 
breeze through review.

 

> Add simple raw "offset-commit-failures", "offset-commits" and 
> "offset-commit-successes" count metric
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6505
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6505
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>    Affects Versions: 1.0.0
>            Reporter: Per Steffensen
>            Priority: Minor
>              Labels: needs-kip
>
> MBean 
> "kafka.connect:type=connector-task-metrics,connector=<connector-name>,task=x" 
> has several attributes. Most of them seems to be avg/max/pct over the entire 
> lifetime of the process. They are not very useful when monitoring a system, 
> where you typically want to see when there have been problems and if there 
> are problems right now.
> E.g. I would like to expose to an administrator when offset-commits have been 
> failing (e.g. timing out) including if they are failing right now. It is 
> really hard to do that properly, just using attribute 
> "offset-commit-failure-percentage". You can expose a number telling how much 
> the percentage has changed between two consecutive polls of the metric - if 
> it changed to the positive side, we saw offset-commit failures, and if it 
> changed to the negative side (or is stable at 0) we saw offset-commit success 
> - at least as long as the system has not been running for so long that a 
> single failing offset-commit does not even change the percentage. But it is 
> really odd, to do it this way.
> *I would like to just see an attribute "offset-commit-failures" just counting 
> how many offset-commits have failed, as an ever-increasing number. Maybe also 
> attributes "offset-commits" and "offset-commit-successes". Then I can do a 
> delta between the two last metric-polls to show how many 
> offset-commit-attempts have failed "very recently". Let this ticket be about 
> that particular added attribute (or the three added attributes).*
> Just a note on metrics IMHO (should probably be posted somewhere else):
> In general consider getting rid of stuff like avg, max, pct over the entire 
> lifetime of the process - current state is what interests people, especially 
> when it comes to failure-related metrics (failure-pct over the lifetime of 
> the process is not very useful). And people will continuously be polling and 
> storing the metrics, so we will have a history of "current state" somewhere 
> else (e.g. in Prometheus). Just give us the raw counts. Modern monitoring 
> tools can do all the avg, max, pct for you based on a time-series of 
> metrics-poll-results - and they can do it for periods of your choice (e.g. 
> average over the last minute or 5 minutes) - have a look at Prometheus PromQL 
> (e.g. used through Grafana). Just expose the raw number and let the 
> average/max/min/pct calculation be done on the collect/presentation side. 
> Only do "advanced" stuff for cases that are very interesting and where it 
> cannot be done based on simple raw number (e.g. percentiles), and consider 
> whether doing it for fairly short intervals is better than for the entire 
> lifetime of the process.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (KAFKA-6505) Add simple raw "offset-commit-failures", "offset-commits" and "offset-commit-successes" count metric

Reply via email to