Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

Jun Rao Thu, 31 Aug 2017 09:35:53 -0700

Hi, Rajini,

Thanks for the updated KIP. Should be also track the conversion time for
the producer? If so, the name can just be MessageConversionTimeMs and only
the produce and the fetch request may have such a component.


Jun

On Wed, Aug 30, 2017 at 1:37 PM, Rajini Sivaram <rajinisiva...@gmail.com>
wrote:

> Jun,
>
> Thank you, your suggestions sound good. I have updated the KIP.
>
> Regards,
>
> Rajini
>
> On Tue, Aug 29, 2017 at 9:12 PM, Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Rajini,
> >
> > Thanks for the updated KIP. I agree that those additional metrics can be
> > useful. I was thinking what would an admin do if the value of one of
> > those metrics is abnormal. An admin probably want to determine which
> client
> > causes the abnormaly. So,  a couple of more comments below.
> >
> > 10. About FetchDownConversionsMs. Should we model it at the topic level
> or
> > at the request level as a new latency component like
> messageConversionTime
> > in addition to the existing localTime, requestQueueTime, etc? The benefit
> > of the latter is that we can include it in the request log and use it to
> > figure out which client is doing the conversion. Also, should we also
> track
> > the conversion time in the producer?
> >
> > 11. About ProduceBatchSize. Currently, the largest chunk of memory is
> > allocated at the request level (mostly produce request). So, instead
> > of ProduceBatchSize
> > , perhaps we can add a request size metric for each type of request and
> > also include it in the request log. This way, if there is a memory issue,
> > we can trace it back to a particular client. Similarly, for
> > ProduceUncompressedBatchSize,
> > would it be better to track it at the request level as something like
> > temporaryMemorySize and include it in the request log?
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover <roger.hoo...@gmail.com>
> > wrote:
> >
> > > Great suggestions, Ismael and thanks for incorporating them, Rajini.
> > >
> > > Tracking authentication success and failures (#3) across listeners
> seems
> > > very useful for cluster administrators to identify misconfigured client
> > or
> > > bad actors, especially until all clients implement KIP-152 which will
> add
> > > an explicit error code for authentication failures.  Currently, clients
> > > just get disconnected so it's hard to distinguish authentication
> failures
> > > from any other error that can cause disconnect.  This broker-side
> metric
> > is
> > > useful regardless but can help fill this gap until all clients support
> > KIP
> > > 152.
> > >
> > > Just to be clear, the ones called `successful-authentication-rate` and
> > > `failed-authentication-rate` will also have failed-authentication-count
> > > and successful-authentication-count to match KIP 187?
> > >
> > > On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram <
> rajinisiva...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Ismael,
> > > >
> > > > Thank you for the suggestions. The additional metrics sound very
> useful
> > > and
> > > > I have added them to the KIP.
> > > >
> > > > Regards,
> > > >
> > > > Rajini
> > > >
> > > > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma <ism...@juma.me.uk>
> > wrote:
> > > >
> > > > > Hi Rajini,
> > > > >
> > > > > There are a few other metrics that could potentially be useful. I'd
> > be
> > > > > interested in what you and the community thinks:
> > > > >
> > > > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which
> is
> > > > > useful. In the common case, one would want to avoid down conversion
> > by
> > > > > using the lower message format supported by most of the consumers.
> > > > However,
> > > > > there are good reasons to use a newer message format even if there
> > are
> > > > some
> > > > > legacy consumers around. It would be good to quantify the cost of
> > these
> > > > > consumers a bit more clearly. Looking at the request metric
> > > `LocalTimeMs`
> > > > > provides a hint, but it may be useful to have a dedicated
> > > > > `FetchDownConversionsMs` metric.
> > > > >
> > > > > 2. Large messages can cause GC issues (it's particularly bad if
> fetch
> > > > down
> > > > > conversion takes place). One can currently configure the max
> message
> > > > batch
> > > > > size per topic to keep this under control, but that is the size
> after
> > > > > compression. However, we decompress the batch to validate produce
> > > > requests
> > > > > and we decompress and recompress during fetch downconversion). It
> > would
> > > > be
> > > > > helpful to have topic metrics for the produce message batch size
> > after
> > > > > decompression (and perhaps compressed as well as that would help
> > > > understand
> > > > > the compression ratio).
> > > > >
> > > > > 3. Authentication success/failures per second. This is helpful to
> > > > > understand if some clients are misconfigured or if bad actors are
> > > trying
> > > > to
> > > > > authenticate.
> > > > >
> > > > > Thoughts?
> > > > >
> > > > > Ismael
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao <j...@confluent.io> wrote:
> > > > >
> > > > > > Hi, Rajini,
> > > > > >
> > > > > > Yes, if those error metrics are registered dynamically, we could
> > > worry
> > > > > > about expiration later.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram <
> > > > rajinisiva...@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Perhaps we could register dynamically for now. If we find that
> > the
> > > > cost
> > > > > > of
> > > > > > > retaining these is high, we can add the code to expire them
> > later.
> > > Is
> > > > > > that
> > > > > > > ok?
> > > > > > >
> > > > > > > Regards,
> > > > > > >
> > > > > > > Rajini
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma <
> ism...@juma.me.uk>
> > > > > wrote:
> > > > > > >
> > > > > > > > Can we quantify the cost of having these metrics around if
> they
> > > are
> > > > > > > > dynamically registered? Given that the maximum is bounded at
> > > > > > development
> > > > > > > > time, is it really worth all the extra code?
> > > > > > > >
> > > > > > > > Ismael
> > > > > > > >
> > > > > > > > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram <
> > > > > > rajinisiva...@gmail.com
> > > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Jun,
> > > > > > > > >
> > > > > > > > > It feels more consistent to add errors as yammer metrics
> > > similar
> > > > to
> > > > > > > other
> > > > > > > > > request metrics. Perhaps we could add some code to track
> and
> > > > remove
> > > > > > > these
> > > > > > > > > if unused? It is a bit more work, but it would keep the
> > > externals
> > > > > > > > > consistent.
> > > > > > > > >
> > > > > > > > > Ismael/Manikumar,
> > > > > > > > >
> > > > > > > > > Agree that version as a String attribute makes more sense.
> > > > > > > Unfortunately,
> > > > > > > > > the whole KafkaMetric implementation is written around a
> > single
> > > > > > > "double"
> > > > > > > > > type, so introducing a new type is a big change. But I
> > suppose
> > > it
> > > > > can
> > > > > > > be
> > > > > > > > > done. I have updated the KIP.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Rajini
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar <
> > > > > > manikumar.re...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > I agree it will be good if we can add  "commit
> id/version"
> > as
> > > > an
> > > > > > > > > > attribute value.
> > > > > > > > > > It will be easy to parse. But as of now, KafkaMetric
> > supports
> > > > > only
> > > > > > > > > > numerical values.
> > > > > > > > > >
> > > > > > > > > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma <
> > > > ism...@juma.me.uk>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi Rajini,
> > > > > > > > > > >
> > > > > > > > > > > About the gauges, I was thinking that the attribute
> would
> > > be
> > > > > the
> > > > > > > > value
> > > > > > > > > > > (i.e. commit id or version). I understand that Kafka
> > > Metrics
> > > > > > > doesn't
> > > > > > > > > > > support this (unlike Yammer Metrics), but would it make
> > > sense
> > > > > to
> > > > > > > add?
> > > > > > > > > > >
> > > > > > > > > > > Ismael
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram <
> > > > > > > > > rajinisiva...@gmail.com
> > > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi Ismael,
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you for the review.
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Agree on keeping it simple with dynamic
> registration
> > > and
> > > > > no
> > > > > > > > > expiry.
> > > > > > > > > > > Will
> > > > > > > > > > > > wait for Jun's feedback before updating KIP.
> > > > > > > > > > > > 2. I have switched to two metrics for commit-id and
> > > version
> > > > > > (not
> > > > > > > > sure
> > > > > > > > > > if
> > > > > > > > > > > it
> > > > > > > > > > > > matches what you meant). I also added the client-id
> tag
> > > > which
> > > > > > is
> > > > > > > > used
> > > > > > > > > > in
> > > > > > > > > > > > all metrics from clients.
> > > > > > > > > > > >
> > > > > > > > > > > > Regards,
> > > > > > > > > > > >
> > > > > > > > > > > > Rajini
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma <
> > > > > > ism...@juma.me.uk
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for the KIP, Rajini. I think this is helpful
> > > too.
> > > > A
> > > > > > few
> > > > > > > > > minor
> > > > > > > > > > > > > comments.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. About the number of metrics and expiration, if
> we
> > > > > > > dynamically
> > > > > > > > > > > register
> > > > > > > > > > > > > metrics for the error codes, the number is likely
> to
> > be
> > > > > much
> > > > > > > > lower
> > > > > > > > > > than
> > > > > > > > > > > > > 30*30, probably less than 100. If we were using
> Kafka
> > > > > Metrics
> > > > > > > for
> > > > > > > > > > this,
> > > > > > > > > > > > we
> > > > > > > > > > > > > could easily add a long expiration period to be
> > > > > conservative,
> > > > > > > > but I
> > > > > > > > > > am
> > > > > > > > > > > > not
> > > > > > > > > > > > > sure this is supported by Yammer Metrics. If it is
> > not,
> > > > > > there's
> > > > > > > > an
> > > > > > > > > > > > argument
> > > > > > > > > > > > > for keeping it simple.
> > > > > > > > > > > > >
> > > > > > > > > > > > > 2. Would it make sense to use 2 gauges for the
> > version
> > > > and
> > > > > > > commit
> > > > > > > > > id?
> > > > > > > > > > > It
> > > > > > > > > > > > > seems more intuitive than having those values as
> > tags.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ismael
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram <
> > > > > > > > > > > > rajinisiva...@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Jun,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thank you for the review.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1. Makes sense. I have updated the KIP.
> > > > > > > > > > > > > > 2. Moved to a new group ZooKeeperClient
> > > > > > > > > > > > > > 3. It is a gauge, so it will have a single
> > attribute
> > > > > called
> > > > > > > > Value
> > > > > > > > > > > with
> > > > > > > > > > > > a
> > > > > > > > > > > > > > constant value of 1.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Regards,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Rajini
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao <
> > > > > j...@confluent.io
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi, Rajini,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks for the KIP. A few comments.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. We have 30+ requests and 30+ error code and
> > > > growing.
> > > > > > So,
> > > > > > > > the
> > > > > > > > > > > > > > combination
> > > > > > > > > > > > > > > can be large. Perhaps it's useful to expire an
> > > error
> > > > > > metric
> > > > > > > > if
> > > > > > > > > > it's
> > > > > > > > > > > > no
> > > > > > > > > > > > > > > longer updated after some time? We did
> something
> > > > > similar
> > > > > > > for
> > > > > > > > > the
> > > > > > > > > > > > quota
> > > > > > > > > > > > > > > metric.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. It's a bit weird to put the ZK latency
> metric
> > > > under
> > > > > > > > > > > > > > > type=SessionExpireListener.
> > > > > > > > > > > > > > > Perhaps it's more intuitive to put it in a
> > separate
> > > > > type.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3. For the client version metric, since we
> > > > representing
> > > > > > > > > commit_id
> > > > > > > > > > > and
> > > > > > > > > > > > > > > version as tags in the metric name. So the
> mbean
> > > will
> > > > > > have
> > > > > > > no
> > > > > > > > > > > > > attributes?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Jun
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover <
> > > > > > > > > > > > roger.hoo...@gmail.com>
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I think it would useful to make clear
> somewhere
> > > for
> > > > > > each
> > > > > > > > > > metric,
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > level
> > > > > > > > > > > > > > > > at which it's counted.  I don't know all the
> > > > details
> > > > > of
> > > > > > > the
> > > > > > > > > > Kafka
> > > > > > > > > > > > > > > protocol
> > > > > > > > > > > > > > > > but it might be something like
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > ProduceRequest, Fetch Request - counted at
> > > > > > per-partition
> > > > > > > > > level
> > > > > > > > > > > > > > > > All other requests are 1:1 with client
> > requests?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Roger
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 4:02 PM, Roger
> Hoover <
> > > > > > > > > > > > > roger.hoo...@gmail.com>
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Rajini,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thank you for the KIP.  These are very
> > helpful
> > > > > > > additions.
> > > > > > > > > > One
> > > > > > > > > > > > > > question
> > > > > > > > > > > > > > > > on
> > > > > > > > > > > > > > > > > the error code metrics:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Will the total error counting happen at the
> > the
> > > > > level
> > > > > > > of
> > > > > > > > > > topic
> > > > > > > > > > > > > > > partition?
> > > > > > > > > > > > > > > > > For example, if a single ProduceRequest
> > > contains
> > > > > > > messages
> > > > > > > > > to
> > > > > > > > > > > > append
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > 3
> > > > > > > > > > > > > > > > > partitions and say all 3 appends are
> > > successful,
> > > > > the
> > > > > > > > > counter
> > > > > > > > > > > > > > > > > for kafka.network:type=
> RequestMetrics,name=
> > > > > > > > > > > ErrorsPerSec,request=
> > > > > > > > > > > > > > > > ProduceRequest,error=0
> > > > > > > > > > > > > > > > > will be incremented by 3?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Roger
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini
> > > Sivaram
> > > > <
> > > > > > > > > > > > > > > > rajinisiva...@gmail.com>
> > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >> I have created a KIP to add some
> additional
> > > > > metrics
> > > > > > to
> > > > > > > > > > support
> > > > > > > > > > > > > > health
> > > > > > > > > > > > > > > > >> checks:
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> https://cwiki.apache.org/
> > > > > > > confluence/display/KAFKA/KIP-
> > > > > > > > > > 188+-+
> > > > > > > > > > > > > > > > >> Add+new+metrics+to+support+health+checks
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Feedback and suggestions are welcome.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Regards,
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Rajini
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-188 - Add new metrics to support health checks

Reply via email to