Re: [DISCUSS] KIP-714: Client metrics and observability

Magnus Edenhill Tue, 07 Jun 2022 06:53:11 -0700

Hey Jun,

I've clarified the scope of the standard metrics in the KIP, but basically:


 * We define a standard set of generic metrics that should be relevant to
most client implementations, e.g., each producer implementation probably
has some sort of per-partition message queue.
 * A client implementation should strive to implement as many of the
standard metrics as possible, but only the ones that make sense.
 * For metrics that are not in the standard set, a client maintainer can
choose to either submit a KIP to add additional standard metrics - if
they're relevant, or go ahead and add custom metrics that are specific to
that client implementation. These custom metrics will have a prefix
specific to that client implementation, as opposed to the standard metric
set that resides under "org.apache.kafka...". E.g.,
"se.edenhill.librdkafka" or whatever.
 * Existing non-KIP-714 metrics should remain untouched. In some cases we
might be able to use the same meter given it is compatible with the
standard metric set definition, in other cases a semi-duplicate meter may
be needed. Thus this will not affect the metrics exposed through JMX, or
vice versa.

Thanks,
Magnus



Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <j...@confluent.io.invalid>:

> Hi, Magnus,
>
> 51. Just to clarify my question.  (1) Are standard metrics required for
> every client for this KIP to function?  (2) Are we converting existing java
> metrics to the standard metrics and deprecating the old ones? If so, could
> we list all existing java metrics that need to be renamed and the
> corresponding new name?
>
> Thanks,
>
> Jun
>
> On Tue, May 31, 2022 at 3:29 PM Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Magnus,
> >
> > Thanks for the reply.
> >
> > 51. I think it's fine to have a list of recommended metrics for every
> > client to implement. I am just not sure that standardizing on the metric
> > names across all clients is practical. The list of common metrics in the
> > KIP have completely different names from the java metric names. Some of
> > them have different types. For example, some of the common metrics have a
> > type of histogram, but the java client metrics don't use histogram in
> > general. Requiring the operator to translate those names and understand
> the
> > subtle differences across clients seem to cause more confusion during
> > troubleshooting.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <mag...@edenhill.se>
> > wrote:
> >
> >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <j...@confluent.io.invalid>:
> >>
> >> > Hi, Magus,
> >> >
> >> > Thanks for the reply.
> >> >
> >> > 50. Sounds good.
> >> >
> >> > 51. I miss-understood the proposal in the KIP then. The proposal is to
> >> > define a set of common metric names that every client should
> implement.
> >> The
> >> > problem is that every client already has its own set of metrics with
> its
> >> > own names. I am not sure that we could easily agree upon a common set
> of
> >> > metrics that work with all clients. There are likely to be some
> metrics
> >> > that are client specific. Translating between the common name and
> client
> >> > specific name is probably going to add more confusion. As mentioned in
> >> the
> >> > KIP, similar metrics from different clients could have subtle
> >> > semantic differences. Could we just let each client use its own set of
> >> > metric names?
> >> >
> >>
> >> We identified a common set of metrics that should be relevant for most
> >> client implementations,
> >> they're the ones listed in the KIP.
> >> A supporting client does not have to implement all those metrics, only
> the
> >> ones that makes sense
> >> based on that client implementation, and a client may implement other
> >> metrics that are not listed
> >> in the KIP under its own namespace.
> >> This approach has two benefits:
> >>  - there will be a common set of metrics that most/all clients
> implement,
> >> which makes monitoring
> >>   and troubleshooting easier across fleets with multiple Kafka client
> >> languages/implementations.
> >>  - client-specific metrics are still possible, so if there is no
> suitable
> >> standard metric a client can still
> >>    provide what special metrics it has.
> >>
> >>
> >> Thanks,
> >> Magnus
> >>
> >>
> >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <mag...@edenhill.se>
> >> wrote:
> >> >
> >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao <j...@confluent.io.invalid
> >> >:
> >> > >
> >> > > > Hi, Magnus,
> >> > > >
> >> > >
> >> > > Hi Jun
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks for the updated KIP. Just a couple of more comments.
> >> > > >
> >> > > > 50. To troubleshoot a particular client issue, I imagine that the
> >> > client
> >> > > > needs to identify its client_instance_id. How does the client find
> >> this
> >> > > > out? Do we plan to include client_instance_id in the client log,
> >> expose
> >> > > it
> >> > > > as a metric or something else?
> >> > > >
> >> > >
> >> > > The KIP suggests that client implementations emit an informative log
> >> > > message
> >> > > with the assigned client-instance-id once it is retrieved (once per
> >> > client
> >> > > instance lifetime).
> >> > > There's also a clientInstanceId() method that an application can use
> >> to
> >> > > retrieve
> >> > > the client instance id and emit through whatever side channels makes
> >> > sense.
> >> > >
> >> > >
> >> > >
> >> > > > 51. The KIP lists a bunch of metrics that need to be collected at
> >> the
> >> > > > client side. However, it seems quite a few useful java client
> >> metrics
> >> > > like
> >> > > > the following are missing.
> >> > > >     buffer-total-bytes
> >> > > >     buffer-available-bytes
> >> > > >
> >> > >
> >> > > These are covered by client.producer.record.queue.bytes and
> >> > > client.producer.record.queue.max.bytes.
> >> > >
> >> > >
> >> > > >     bufferpool-wait-time
> >> > > >
> >> > >
> >> > > Missing, but somewhat implementation specific.
> >> > > If it was up to me we would add this later if there's a need.
> >> > >
> >> > >
> >> > >
> >> > > >     batch-size-avg
> >> > > >     batch-size-max
> >> > > >
> >> > >
> >> > > These are missing and would be suitably represented as a histogram.
> >> I'll
> >> > > add them.
> >> > >
> >> > >
> >> > >
> >> > > >     io-wait-ratio
> >> > > >     io-ratio
> >> > > >
> >> > >
> >> > > There's client.io.wait.time which should cover io-wait-ratio.
> >> > > We could add a client.io.time as well, now or in a later KIP.
> >> > >
> >> > > Thanks,
> >> > > Magnus
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Jun
> >> > > >
> >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <j...@confluent.io> wrote:
> >> > > >
> >> > > > > Hi, Xavier,
> >> > > > >
> >> > > > > Thanks for the reply.
> >> > > > >
> >> > > > > 28. It does seem that we have started using KafkaMetrics on the
> >> > broker
> >> > > > > side. Then, my only concern is on the usage of Histogram in
> >> > > KafkaMetrics.
> >> > > > > Histogram in KafkaMetrics statically divides the value space
> into
> >> a
> >> > > fixed
> >> > > > > number of buckets and only returns values on the bucket
> boundary.
> >> So,
> >> > > the
> >> > > > > returned histogram value may never show up in a recorded value.
> >> > Yammer
> >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> >> reported
> >> > > value
> >> > > > > is always one of the recorded values. So, I am not sure that
> >> > Histogram
> >> > > in
> >> > > > > KafkaMetrics is as good as Yammer Histogram.
> >> > > > ClientMetricsPluginExportTime
> >> > > > > uses Histogram.
> >> > > > >
> >> > > > > Thanks,
> >> > > > >
> >> > > > > Jun
> >> > > > >
> >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> >> > > > <xav...@confluent.io.invalid>
> >> > > > > wrote:
> >> > > > >
> >> > > > >> >
> >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only for
> >> > metrics
> >> > > > >> that
> >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> Kafka
> >> > > > metric.
> >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> timer.
> >> > > meter
> >> > > > >> > calculates a rate, but also exposes an accumulated value.
> >> > > > >> >
> >> > > > >>
> >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> >> > metrics
> >> > > on
> >> > > > >> the broker. KafkaMetrics was written
> >> > > > >> to replace Yammer metrics and is used for all new components
> >> > (clients,
> >> > > > >> streams, connect, etc.)
> >> > > > >> My understanding is that the original goal was to retire Yammer
> >> > > metrics
> >> > > > in
> >> > > > >> the broker in favor of KafkaMetrics.
> >> > > > >> We just haven't done so out of backwards compatibility
> concerns.
> >> > > > >> There are other broker metrics such as group coordinator,
> >> > transaction
> >> > > > >> state
> >> > > > >> manager, and various socket server metrics
> >> > > > >> already using KafkaMetrics that don't need specific Kafka
> metric
> >> > > > features,
> >> > > > >> so I don't see why we should refrain from using
> >> > > > >> Kafka metrics on the broker unless there are real compatibility
> >> > > concerns
> >> > > > >> or
> >> > > > >> where implementation specifics could lead to confusion when
> >> > comparing
> >> > > > >> metrics using different implementations.
> >> > > > >>
> >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> >> going
> >> > > > forward
> >> > > > >> on the broker as well, for two reasons:
> >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> >> > > > >> b) yammer metrics are much less expressive
> >> > > > >> c) we don't have a proper API to expose yammer metrics outside
> of
> >> > JMX
> >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> >> > > > >>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to