Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

Ismael Juma Tue, 29 Aug 2017 02:35:08 -0700

Hi Rajini,

There are a few other metrics that could potentially be useful. I'd be
interested in what you and the community thinks:


1. The KIP currently includes `FetchDownConversionsPerSec`, which is
useful. In the common case, one would want to avoid down conversion by
using the lower message format supported by most of the consumers. However,
there are good reasons to use a newer message format even if there are some
legacy consumers around. It would be good to quantify the cost of these
consumers a bit more clearly. Looking at the request metric `LocalTimeMs`
provides a hint, but it may be useful to have a dedicated
`FetchDownConversionsMs` metric.

2. Large messages can cause GC issues (it's particularly bad if fetch down
conversion takes place). One can currently configure the max message batch
size per topic to keep this under control, but that is the size after
compression. However, we decompress the batch to validate produce requests
and we decompress and recompress during fetch downconversion). It would be
helpful to have topic metrics for the produce message batch size after
decompression (and perhaps compressed as well as that would help understand
the compression ratio).

3. Authentication success/failures per second. This is helpful to
understand if some clients are misconfigured or if bad actors are trying to
authenticate.

Thoughts?

Ismael



On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao <j...@confluent.io> wrote:

> Hi, Rajini,
>
> Yes, if those error metrics are registered dynamically, we could worry
> about expiration later.
>
> Thanks,
>
> Jun
>
> On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram <rajinisiva...@gmail.com>
> wrote:
>
> > Perhaps we could register dynamically for now. If we find that the cost
> of
> > retaining these is high, we can add the code to expire them later. Is
> that
> > ok?
> >
> > Regards,
> >
> > Rajini
> >
> >
> >
> > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma <ism...@juma.me.uk> wrote:
> >
> > > Can we quantify the cost of having these metrics around if they are
> > > dynamically registered? Given that the maximum is bounded at
> development
> > > time, is it really worth all the extra code?
> > >
> > > Ismael
> > >
> > > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram <
> rajinisiva...@gmail.com
> > >
> > > wrote:
> > >
> > > > Jun,
> > > >
> > > > It feels more consistent to add errors as yammer metrics similar to
> > other
> > > > request metrics. Perhaps we could add some code to track and remove
> > these
> > > > if unused? It is a bit more work, but it would keep the externals
> > > > consistent.
> > > >
> > > > Ismael/Manikumar,
> > > >
> > > > Agree that version as a String attribute makes more sense.
> > Unfortunately,
> > > > the whole KafkaMetric implementation is written around a single
> > "double"
> > > > type, so introducing a new type is a big change. But I suppose it can
> > be
> > > > done. I have updated the KIP.
> > > >
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > > >
> > > > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar <
> manikumar.re...@gmail.com>
> > > > wrote:
> > > >
> > > > > I agree it will be good if we can add  "commit id/version" as an
> > > > > attribute value.
> > > > > It will be easy to parse. But as of now, KafkaMetric supports only
> > > > > numerical values.
> > > > >
> > > > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma <ism...@juma.me.uk>
> > > wrote:
> > > > >
> > > > > > Hi Rajini,
> > > > > >
> > > > > > About the gauges, I was thinking that the attribute would be the
> > > value
> > > > > > (i.e. commit id or version). I understand that Kafka Metrics
> > doesn't
> > > > > > support this (unlike Yammer Metrics), but would it make sense to
> > add?
> > > > > >
> > > > > > Ismael
> > > > > >
> > > > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
> > > > rajinisiva...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Ismael,
> > > > > > >
> > > > > > > Thank you for the review.
> > > > > > >
> > > > > > > 1. Agree on keeping it simple with dynamic registration and no
> > > > expiry.
> > > > > > Will
> > > > > > > wait for Jun's feedback before updating KIP.
> > > > > > > 2. I have switched to two metrics for commit-id and version
> (not
> > > sure
> > > > > if
> > > > > > it
> > > > > > > matches what you meant). I also added the client-id tag which
> is
> > > used
> > > > > in
> > > > > > > all metrics from clients.
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Rajini
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma <
> ism...@juma.me.uk
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Thanks for the KIP, Rajini. I think this is helpful too. A
> few
> > > > minor
> > > > > > > > comments.
> > > > > > > >
> > > > > > > > 1. About the number of metrics and expiration, if we
> > dynamically
> > > > > > register
> > > > > > > > metrics for the error codes, the number is likely to be much
> > > lower
> > > > > than
> > > > > > > > 30*30, probably less than 100. If we were using Kafka Metrics
> > for
> > > > > this,
> > > > > > > we
> > > > > > > > could easily add a long expiration period to be conservative,
> > > but I
> > > > > am
> > > > > > > not
> > > > > > > > sure this is supported by Yammer Metrics. If it is not,
> there's
> > > an
> > > > > > > argument
> > > > > > > > for keeping it simple.
> > > > > > > >
> > > > > > > > 2. Would it make sense to use 2 gauges for the version and
> > commit
> > > > id?
> > > > > > It
> > > > > > > > seems more intuitive than having those values as tags.
> > > > > > > >
> > > > > > > > Ismael
> > > > > > > >
> > > > > > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
> > > > > > > rajinisiva...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi Jun,
> > > > > > > > >
> > > > > > > > > Thank you for the review.
> > > > > > > > >
> > > > > > > > > 1. Makes sense. I have updated the KIP.
> > > > > > > > > 2. Moved to a new group ZooKeeperClient
> > > > > > > > > 3. It is a gauge, so it will have a single attribute called
> > > Value
> > > > > > with
> > > > > > > a
> > > > > > > > > constant value of 1.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Rajini
> > > > > > > > >
> > > > > > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao <j...@confluent.io
> >
> > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi, Rajini,
> > > > > > > > > >
> > > > > > > > > > Thanks for the KIP. A few comments.
> > > > > > > > > >
> > > > > > > > > > 1. We have 30+ requests and 30+ error code and growing.
> So,
> > > the
> > > > > > > > > combination
> > > > > > > > > > can be large. Perhaps it's useful to expire an error
> metric
> > > if
> > > > > it's
> > > > > > > no
> > > > > > > > > > longer updated after some time? We did something similar
> > for
> > > > the
> > > > > > > quota
> > > > > > > > > > metric.
> > > > > > > > > >
> > > > > > > > > > 2. It's a bit weird to put the ZK latency metric under
> > > > > > > > > > type=SessionExpireListener.
> > > > > > > > > > Perhaps it's more intuitive to put it in a separate type.
> > > > > > > > > >
> > > > > > > > > > 3. For the client version metric, since we representing
> > > > commit_id
> > > > > > and
> > > > > > > > > > version as tags in the metric name. So the mbean will
> have
> > no
> > > > > > > > attributes?
> > > > > > > > > >
> > > > > > > > > > Jun
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover <
> > > > > > > roger.hoo...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I think it would useful to make clear somewhere for
> each
> > > > > metric,
> > > > > > > the
> > > > > > > > > > level
> > > > > > > > > > > at which it's counted.  I don't know all the details of
> > the
> > > > > Kafka
> > > > > > > > > > protocol
> > > > > > > > > > > but it might be something like
> > > > > > > > > > >
> > > > > > > > > > > ProduceRequest, Fetch Request - counted at
> per-partition
> > > > level
> > > > > > > > > > > All other requests are 1:1 with client requests?
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > >
> > > > > > > > > > > Roger
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover <
> > > > > > > > roger.hoo...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Rajini,
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for the KIP.  These are very helpful
> > additions.
> > > > > One
> > > > > > > > > question
> > > > > > > > > > > on
> > > > > > > > > > > > the error code metrics:
> > > > > > > > > > > >
> > > > > > > > > > > > Will the total error counting happen at the the level
> > of
> > > > > topic
> > > > > > > > > > partition?
> > > > > > > > > > > > For example, if a single ProduceRequest contains
> > messages
> > > > to
> > > > > > > append
> > > > > > > > > to
> > > > > > > > > > 3
> > > > > > > > > > > > partitions and say all 3 appends are successful, the
> > > > counter
> > > > > > > > > > > > for kafka.network:type=RequestMetrics,name=
> > > > > > ErrorsPerSec,request=
> > > > > > > > > > > ProduceRequest,error=0
> > > > > > > > > > > > will be incremented by 3?
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > >
> > > > > > > > > > > > Roger
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini Sivaram <
> > > > > > > > > > > rajinisiva...@gmail.com>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> I have created a KIP to add some additional metrics
> to
> > > > > support
> > > > > > > > > health
> > > > > > > > > > > >> checks:
> > > > > > > > > > > >>
> > > > > > > > > > > >> https://cwiki.apache.org/
> > confluence/display/KAFKA/KIP-
> > > > > 188+-+
> > > > > > > > > > > >> Add+new+metrics+to+support+health+checks
> > > > > > > > > > > >>
> > > > > > > > > > > >> Feedback and suggestions are welcome.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Regards,
> > > > > > > > > > > >>
> > > > > > > > > > > >> Rajini
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

Reply via email to