Re: [DISCUSS] KIP-714: Client metrics and observability

Kirk True Thu, 16 Jun 2022 11:59:48 -0700

Hi Jun,

I'll try to answer the questions posed...


On Tue, Jun 7, 2022, at 4:32 PM, Jun Rao wrote:
> Hi, Magnus,
> 
> Thanks for the reply.
> 
> So, the standard set of generic metrics is just a recommendation and not a
> requirement? This sounds good to me since it makes the adoption of the KIP
> easier.

I believe that was the intent, yes.

> Regarding the metric names, I have two concerns.

(I'm splitting these two up for readability...)

> (1) If a client already
> has an existing metric similar to the standard one, duplicating the metric
> seems to be confusing.

Agreed. I'm dealing with that situation as I write the Java client 
implementation.

The existing Java client exposes a set of metrics via JMX. The updated Java 
client will introduce a second set of metrics, which instead are exposed via 
sending them to the broker. There is substantial overlap with the two set of 
metrics and in a few places in the code under development, there are 
essentially two separate calls to update metrics: one for the JMX-bound metrics 
and one for the broker-bound metrics.

To be candid, I have gone back-and-forth on that design. From one perspective, 
it could be argued that the set of client metrics should be standardized across 
a given client, regardless of how those metrics are exposed for consumption. 
Another perspective is that these two sets of metrics serve different purposes 
and/or have different audiences, which argues that they should maintain their 
individuality and purpose. Your inputs/suggestions are certainly welcome! 

> (2) If a client needs to implement a standard metric
> that doesn't exist yet, using a naming convention (e.g., using dash vs dot)
> different from other existing metrics also seems a bit confusing. It seems
> that the main benefit of having standard metric names across clients is for
> better server side monitoring. Could we do the standardization in the
> plugin on the server?

I think the expectation is that the plugin implementation will perform 
transformation of metric names, if needed, to fit in with an organization's 
monitoring naming standards. Perhaps we need to call that out in the KIP itself.

Thanks,
Kirk

> 
> Thanks,
> 
> Jun
> 
> 
> 
> On Tue, Jun 7, 2022 at 6:53 AM Magnus Edenhill <[email protected]> wrote:
> 
> > Hey Jun,
> >
> > I've clarified the scope of the standard metrics in the KIP, but basically:
> >
> >  * We define a standard set of generic metrics that should be relevant to
> > most client implementations, e.g., each producer implementation probably
> > has some sort of per-partition message queue.
> >  * A client implementation should strive to implement as many of the
> > standard metrics as possible, but only the ones that make sense.
> >  * For metrics that are not in the standard set, a client maintainer can
> > choose to either submit a KIP to add additional standard metrics - if
> > they're relevant, or go ahead and add custom metrics that are specific to
> > that client implementation. These custom metrics will have a prefix
> > specific to that client implementation, as opposed to the standard metric
> > set that resides under "org.apache.kafka...". E.g.,
> > "se.edenhill.librdkafka" or whatever.
> >  * Existing non-KIP-714 metrics should remain untouched. In some cases we
> > might be able to use the same meter given it is compatible with the
> > standard metric set definition, in other cases a semi-duplicate meter may
> > be needed. Thus this will not affect the metrics exposed through JMX, or
> > vice versa.
> >
> > Thanks,
> > Magnus
> >
> >
> >
> > Den ons 1 juni 2022 kl 18:55 skrev Jun Rao <[email protected]>:
> >
> > > Hi, Magnus,
> > >
> > > 51. Just to clarify my question.  (1) Are standard metrics required for
> > > every client for this KIP to function?  (2) Are we converting existing
> > java
> > > metrics to the standard metrics and deprecating the old ones? If so,
> > could
> > > we list all existing java metrics that need to be renamed and the
> > > corresponding new name?
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Tue, May 31, 2022 at 3:29 PM Jun Rao <[email protected]> wrote:
> > >
> > > > Hi, Magnus,
> > > >
> > > > Thanks for the reply.
> > > >
> > > > 51. I think it's fine to have a list of recommended metrics for every
> > > > client to implement. I am just not sure that standardizing on the
> > metric
> > > > names across all clients is practical. The list of common metrics in
> > the
> > > > KIP have completely different names from the java metric names. Some of
> > > > them have different types. For example, some of the common metrics
> > have a
> > > > type of histogram, but the java client metrics don't use histogram in
> > > > general. Requiring the operator to translate those names and understand
> > > the
> > > > subtle differences across clients seem to cause more confusion during
> > > > troubleshooting.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Tue, May 31, 2022 at 5:02 AM Magnus Edenhill <[email protected]>
> > > > wrote:
> > > >
> > > >> Den fre 20 maj 2022 kl 01:23 skrev Jun Rao <[email protected]
> > >:
> > > >>
> > > >> > Hi, Magus,
> > > >> >
> > > >> > Thanks for the reply.
> > > >> >
> > > >> > 50. Sounds good.
> > > >> >
> > > >> > 51. I miss-understood the proposal in the KIP then. The proposal is
> > to
> > > >> > define a set of common metric names that every client should
> > > implement.
> > > >> The
> > > >> > problem is that every client already has its own set of metrics with
> > > its
> > > >> > own names. I am not sure that we could easily agree upon a common
> > set
> > > of
> > > >> > metrics that work with all clients. There are likely to be some
> > > metrics
> > > >> > that are client specific. Translating between the common name and
> > > client
> > > >> > specific name is probably going to add more confusion. As mentioned
> > in
> > > >> the
> > > >> > KIP, similar metrics from different clients could have subtle
> > > >> > semantic differences. Could we just let each client use its own set
> > of
> > > >> > metric names?
> > > >> >
> > > >>
> > > >> We identified a common set of metrics that should be relevant for most
> > > >> client implementations,
> > > >> they're the ones listed in the KIP.
> > > >> A supporting client does not have to implement all those metrics, only
> > > the
> > > >> ones that makes sense
> > > >> based on that client implementation, and a client may implement other
> > > >> metrics that are not listed
> > > >> in the KIP under its own namespace.
> > > >> This approach has two benefits:
> > > >>  - there will be a common set of metrics that most/all clients
> > > implement,
> > > >> which makes monitoring
> > > >>   and troubleshooting easier across fleets with multiple Kafka client
> > > >> languages/implementations.
> > > >>  - client-specific metrics are still possible, so if there is no
> > > suitable
> > > >> standard metric a client can still
> > > >>    provide what special metrics it has.
> > > >>
> > > >>
> > > >> Thanks,
> > > >> Magnus
> > > >>
> > > >>
> > > >> On Thu, May 19, 2022 at 10:39 AM Magnus Edenhill <[email protected]>
> > > >> wrote:
> > > >> >
> > > >> > > Den ons 18 maj 2022 kl 19:57 skrev Jun Rao
> > <[email protected]
> > > >> >:
> > > >> > >
> > > >> > > > Hi, Magnus,
> > > >> > > >
> > > >> > >
> > > >> > > Hi Jun
> > > >> > >
> > > >> > >
> > > >> > > >
> > > >> > > > Thanks for the updated KIP. Just a couple of more comments.
> > > >> > > >
> > > >> > > > 50. To troubleshoot a particular client issue, I imagine that
> > the
> > > >> > client
> > > >> > > > needs to identify its client_instance_id. How does the client
> > find
> > > >> this
> > > >> > > > out? Do we plan to include client_instance_id in the client log,
> > > >> expose
> > > >> > > it
> > > >> > > > as a metric or something else?
> > > >> > > >
> > > >> > >
> > > >> > > The KIP suggests that client implementations emit an informative
> > log
> > > >> > > message
> > > >> > > with the assigned client-instance-id once it is retrieved (once
> > per
> > > >> > client
> > > >> > > instance lifetime).
> > > >> > > There's also a clientInstanceId() method that an application can
> > use
> > > >> to
> > > >> > > retrieve
> > > >> > > the client instance id and emit through whatever side channels
> > makes
> > > >> > sense.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > > 51. The KIP lists a bunch of metrics that need to be collected
> > at
> > > >> the
> > > >> > > > client side. However, it seems quite a few useful java client
> > > >> metrics
> > > >> > > like
> > > >> > > > the following are missing.
> > > >> > > >     buffer-total-bytes
> > > >> > > >     buffer-available-bytes
> > > >> > > >
> > > >> > >
> > > >> > > These are covered by client.producer.record.queue.bytes and
> > > >> > > client.producer.record.queue.max.bytes.
> > > >> > >
> > > >> > >
> > > >> > > >     bufferpool-wait-time
> > > >> > > >
> > > >> > >
> > > >> > > Missing, but somewhat implementation specific.
> > > >> > > If it was up to me we would add this later if there's a need.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >     batch-size-avg
> > > >> > > >     batch-size-max
> > > >> > > >
> > > >> > >
> > > >> > > These are missing and would be suitably represented as a
> > histogram.
> > > >> I'll
> > > >> > > add them.
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >     io-wait-ratio
> > > >> > > >     io-ratio
> > > >> > > >
> > > >> > >
> > > >> > > There's client.io.wait.time which should cover io-wait-ratio.
> > > >> > > We could add a client.io.time as well, now or in a later KIP.
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Magnus
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Jun
> > > >> > > >
> > > >> > > > On Mon, Apr 4, 2022 at 10:01 AM Jun Rao <[email protected]>
> > wrote:
> > > >> > > >
> > > >> > > > > Hi, Xavier,
> > > >> > > > >
> > > >> > > > > Thanks for the reply.
> > > >> > > > >
> > > >> > > > > 28. It does seem that we have started using KafkaMetrics on
> > the
> > > >> > broker
> > > >> > > > > side. Then, my only concern is on the usage of Histogram in
> > > >> > > KafkaMetrics.
> > > >> > > > > Histogram in KafkaMetrics statically divides the value space
> > > into
> > > >> a
> > > >> > > fixed
> > > >> > > > > number of buckets and only returns values on the bucket
> > > boundary.
> > > >> So,
> > > >> > > the
> > > >> > > > > returned histogram value may never show up in a recorded
> > value.
> > > >> > Yammer
> > > >> > > > > Histogram, on the other hand, uses reservoir sampling. The
> > > >> reported
> > > >> > > value
> > > >> > > > > is always one of the recorded values. So, I am not sure that
> > > >> > Histogram
> > > >> > > in
> > > >> > > > > KafkaMetrics is as good as Yammer Histogram.
> > > >> > > > ClientMetricsPluginExportTime
> > > >> > > > > uses Histogram.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > >
> > > >> > > > > Jun
> > > >> > > > >
> > > >> > > > > On Thu, Mar 31, 2022 at 5:21 PM Xavier Léauté
> > > >> > > > <[email protected]>
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > >> >
> > > >> > > > >> > 28. On the broker, we typically use Yammer metrics. Only
> > for
> > > >> > metrics
> > > >> > > > >> that
> > > >> > > > >> > depend on Kafka metric features (e.g., quota), we use the
> > > Kafka
> > > >> > > > metric.
> > > >> > > > >> > Yammer metrics have 4 types: gauge, meter, histogram and
> > > timer.
> > > >> > > meter
> > > >> > > > >> > calculates a rate, but also exposes an accumulated value.
> > > >> > > > >> >
> > > >> > > > >>
> > > >> > > > >> I don't see a good reason we should limit ourselves to Yammer
> > > >> > metrics
> > > >> > > on
> > > >> > > > >> the broker. KafkaMetrics was written
> > > >> > > > >> to replace Yammer metrics and is used for all new components
> > > >> > (clients,
> > > >> > > > >> streams, connect, etc.)
> > > >> > > > >> My understanding is that the original goal was to retire
> > Yammer
> > > >> > > metrics
> > > >> > > > in
> > > >> > > > >> the broker in favor of KafkaMetrics.
> > > >> > > > >> We just haven't done so out of backwards compatibility
> > > concerns.
> > > >> > > > >> There are other broker metrics such as group coordinator,
> > > >> > transaction
> > > >> > > > >> state
> > > >> > > > >> manager, and various socket server metrics
> > > >> > > > >> already using KafkaMetrics that don't need specific Kafka
> > > metric
> > > >> > > > features,
> > > >> > > > >> so I don't see why we should refrain from using
> > > >> > > > >> Kafka metrics on the broker unless there are real
> > compatibility
> > > >> > > concerns
> > > >> > > > >> or
> > > >> > > > >> where implementation specifics could lead to confusion when
> > > >> > comparing
> > > >> > > > >> metrics using different implementations.
> > > >> > > > >>
> > > >> > > > >> In my opinion we should encourage people to use KafkaMetrics
> > > >> going
> > > >> > > > forward
> > > >> > > > >> on the broker as well, for two reasons:
> > > >> > > > >> a) yammer metrics is long deprecated and no longer maintained
> > > >> > > > >> b) yammer metrics are much less expressive
> > > >> > > > >> c) we don't have a proper API to expose yammer metrics
> > outside
> > > of
> > > >> > JMX
> > > >> > > > >> (MetricsReporter only exposes KafkaMetrics)
> > > >> > > > >>
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to