Re: [DISCUSS] KIP-714: Client metrics and observability

Magnus Edenhill Tue, 31 May 2022 05:02:45 -0700

Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <j...@confluent.io.invalid>:


> Hi, Magus,
>
> Thanks for the reply.
>
> 50. Sounds good.
>
> 51. I miss-understood the proposal in the KIP then. The proposal is to
> define a set of common metric names that every client should implement. The
> problem is that every client already has its own set of metrics with its
> own names. I am not sure that we could easily agree upon a common set of
> metrics that work with all clients. There are likely to be some metrics
> that are client specific. Translating between the common name and client
> specific name is probably going to add more confusion. As mentioned in the
> KIP, similar metrics from different clients could have subtle
> semantic differences. Could we just let each client use its own set of
> metric names?
>

We identified a common set of metrics that should be relevant for most
client implementations,
they're the ones listed in the KIP.
A supporting client does not have to implement all those metrics, only the
ones that makes sense
based on that client implementation, and a client may implement other
metrics that are not listed
in the KIP under its own namespace.
This approach has two benefits:
 - there will be a common set of metrics that most/all clients implement,
which makes monitoring
  and troubleshooting easier across fleets with multiple Kafka client
languages/implementations.
 - client-specific metrics are still possible, so if there is no suitable
standard metric a client can still
   provide what special metrics it has.


Thanks,
Magnus


On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <mag...@edenhill.se> wrote:
>
> > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <j...@confluent.io.invalid>:
> >
> > > Hi, Magnus,
> > >
> >
> > Hi Jun
> >
> >
> > >
> > > Thanks for the updated KIP. Just a couple of more comments.
> > >
> > > 50. To troubleshoot a particular client issue, I imagine that the
> client
> > > needs to identify its client_instance_id. How does the client find this
> > > out? Do we plan to include client_instance_id in the client log, expose
> > it
> > > as a metric or something else?
> > >
> >
> > The KIP suggests that client implementations emit an informative log
> > message
> > with the assigned client-instance-id once it is retrieved (once per
> client
> > instance lifetime).
> > There's also a clientInstanceId() method that an application can use to
> > retrieve
> > the client instance id and emit through whatever side channels makes
> sense.
> >
> >
> >
> > > 51. The KIP lists a bunch of metrics that need to be collected at the
> > > client side. However, it seems quite a few useful java client metrics
> > like
> > > the following are missing.
> > >     buffer-total-bytes
> > >     buffer-available-bytes
> > >
> >
> > These are covered by client.producer.record.queue.bytes and
> > client.producer.record.queue.max.bytes.
> >
> >
> > >     bufferpool-wait-time
> > >
> >
> > Missing, but somewhat implementation specific.
> > If it was up to me we would add this later if there's a need.
> >
> >
> >
> > >     batch-size-avg
> > >     batch-size-max
> > >
> >
> > These are missing and would be suitably represented as a histogram. I'll
> > add them.
> >
> >
> >
> > >     io-wait-ratio
> > >     io-ratio
> > >
> >
> > There's client.io.wait.time which should cover io-wait-ratio.
> > We could add a client.io.time as well, now or in a later KIP.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <j...@confluent.io> wrote:
> > >
> > > > Hi, Xavier,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 28. It does seem that we have started using KafkaMetrics on the
> broker
> > > > side. Then, my only concern is on the usage of Histogram in
> > KafkaMetrics.
> > > > Histogram in KafkaMetrics statically divides the value space into a
> > fixed
> > > > number of buckets and only returns values on the bucket boundary. So,
> > the
> > > > returned histogram value may never show up in a recorded value.
> Yammer
> > > > Histogram, on the other hand, uses reservoir sampling. The reported
> > value
> > > > is always one of the recorded values. So, I am not sure that
> Histogram
> > in
> > > > KafkaMetrics is as good as Yammer Histogram.
> > > ClientMetricsPluginExportTime
> > > > uses Histogram.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > <xav...@confluent.io.invalid>
> > > > wrote:
> > > >
> > > >> >
> > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> metrics
> > > >> that
> > > >> > depend on Kafka metric features (e.g., quota), we use the Kafka
> > > metric.
> > > >> > Yammer metrics have 4 types: gauge, meter, histogram and timer.
> > meter
> > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> >
> > > >>
> > > >> I don't see a good reason we should limit ourselves to Yammer
> metrics
> > on
> > > >> the broker. KafkaMetrics was written
> > > >> to replace Yammer metrics and is used for all new components
> (clients,
> > > >> streams, connect, etc.)
> > > >> My understanding is that the original goal was to retire Yammer
> > metrics
> > > in
> > > >> the broker in favor of KafkaMetrics.
> > > >> We just haven't done so out of backwards compatibility concerns.
> > > >> There are other broker metrics such as group coordinator,
> transaction
> > > >> state
> > > >> manager, and various socket server metrics
> > > >> already using KafkaMetrics that don't need specific Kafka metric
> > > features,
> > > >> so I don't see why we should refrain from using
> > > >> Kafka metrics on the broker unless there are real compatibility
> > concerns
> > > >> or
> > > >> where implementation specifics could lead to confusion when
> comparing
> > > >> metrics using different implementations.
> > > >>
> > > >> In my opinion we should encourage people to use KafkaMetrics going
> > > forward
> > > >> on the broker as well, for two reasons:
> > > >> a) yammer metrics is long deprecated and no longer maintained
> > > >> b) yammer metrics are much less expressive
> > > >> c) we don't have a proper API to expose yammer metrics outside of
> JMX
> > > >> (MetricsReporter only exposes KafkaMetrics)
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to