Re: [DISCUSS] KIP-714: Client metrics and observability

Magnus Edenhill Wed, 16 Jun 2021 08:11:55 -0700

Thanks for your feedback, Travis!

I believe there are different audiences and uses for application (business
logic)
and client (infrastructure) metrics. Kafka clients are part of the
infrastructure,
not the business logic, and should be monitored as such by the organization,
sub-organization, or team, that knows Kafka best and already do Kafka
monitoring - the Kafka operators.



So to be clear, this KIP does not cover application metrics, but Kafka
client metrics.
It does in no way replace or change the way application metrics are
collected, they are
not relevant to the intended use.

An analogy from the telco space are CPEs (customer premises equipment),
e.g. an ADSL router in the customer's home. The network owner - the
infrastructure operator -
monitors the ADSL router metrics for queue pressure, latencies, error
rates, etc, which allows
the operator to effectively troubleshoot customer issues, scale the
network, and foresee
issues, completely without any intervention needed by the end user itself.
This is what we want to achieve with this KIP, extending the infrastructure
operator's
(aka the Kafka cluster operator) monitoring abilities to allow for
end-to-end troubleshooting and observability.


The collection model in the KIP is subscription-based, no metrics will be
collected by default.
Two things need to happen before anything is collected:
 - a metrics plugin needs to be configured on the brokers. This is a custom
plugin to
   serve whatever needs the operator might have for the metrics.
 - client metric subscriptions need to be configured through the Kafka
Admin API to
   select which metrics to collect. The subscription defines what metrics
to collect and at
  what interval; this effectively puts filtering at the edge (client) to
spare central resources.

This functionality is thus opt-in on the cluster side, and opt-out on the
client side, and
great care is taken not to expose any sensitive information in the metrics.


As for what needs to be implemented by a supporting client;
a supporting client does not need to implement all the defined metrics,
each client maintainer may choose
her own subset that makes sense for that given client implementation, and
it is fine to add metrics not
listed in the KIP as long as they're in the client's namespace.
But there's obviously value in having a shared set of common metrics that
all clients provide.
The goal is for all client implementations to support this.


Regards,
Magnus

Den mån 14 juni 2021 kl 16:24 skrev Travis Bischel <travis.bisc...@gmail.com
>:

> Hi! I have a few thoughts on this KIP. First, I'd like to thank you for
> the writeup,
> clearly a lot of thought has gone into it and it is very thorough.
> However, I'm not
> convinced it's the right approach from a fundamental level.
>
> Fundamentally, this KIP seems like somewhat of a solution to an
> organizational
> problem. Metrics are organizational concerns, not Kafka operator concerns.
> Clients should make it easy to plug in metrics (this is the approach I
> take in
> my own client), and organizations should have processes such that all
> clients
> gather and ship metrics how that organization desires. If an organization
> is
> set up correctly, there is no reason for metrics to be forwarded through
> Kafka.
> This feels like a solution to an organization not properly setting up how
> processes ship metrics, and in some ways, it's an overbroad solution, and
> in
> other ways, it doesn't cover the entire problem.
>
> From the perspective of Kafka operators, it is easy to see that this KIP is
> nice in that it just dictates what clients should support for metrics and
> that
> the metrics should ship through Kafka. But, from the perspective of an
> observability team, this workflow is basically hijacking the standard flow
> that
> organizations may have. I would rather have applications collect metrics
> and
> ship them the same way every other application does. I'd rather not have to
> configure additional plugins within Kafka to take metrics and forward them.
>
> More importantly, this KIP prescibes cardinality problems, requires that to
> officially support the KIP a client must support all relevant metrics
> within
> the KIP, and requires that a client cannot support other metrics unless
> those
> other metrics also go through a KIP process. It is difficult to imagine
> all of
> these metrics being relevant to every organization, and there is no way
> for an
> organization to filter what is relevant within the client. Instead, the
> filtering is pushed downwards, meaning more network IO and more CPU costs
> to
> filter what is irrelevant and aggregate what needs to be aggregated, and
> more
> time for an organization to setup whatever it is that will be doing this
> filtering and aggregating. Contrast this with a client that enables
> hooking in
> to capture numbers that are relevant within an org itself: the org can
> gather
> what they want, ship only want they want, and ship directly to the
> observability system they have already set up. As an aside, it may also be
> wise to avoid shipping metrics through Kafka about client interaction with
> Kafka, because if Kafka is having problems, then orgs lose insight into
> those
> problems. This would be like statuspage using itself for status on its own
> systems.
>
> Another downside is that by dictating the important metrics, this KIP
> either
> has two choices: try to choose what is important to every org, and
> inevitably
> leave out something important to somebody else, or just add everything and
> let
> the orgs filter. This KIP mostly looks to go with the latter approach,
> meaning
> orgs will be shipping & filtering. With hooks, an org would be able to
> gather
> exactly what they want.
>
> As well, I expect that org applications have metrics on the state of the
> applications outside of the Kafka client. Applications are already sending
> non-Kafka-client related metrics outbound to observability systems. If a
> Kafka
> client provided hooks, then users could just gather the additional relevant
> Kafka client metrics and ship those metrics the same way they do all of
> their
> other metrics. It feels a bit odd for a Kafka client to have its own
> separate
> way of forwarding metrics. Another benefit hooks in clients is that
> organizations do not _have_ to set up additional plugins to forward metrics
> from Kafka. Hooks avoid extra organizational work.
>
> The option that the KIP provides for users of clients to opt out of
> metrics may
> avoid some of the above issues (by just disabling things at the user
> level),
> but that's not really great from the perspective of client authors,
> because the
> existence of this KIP forces authors to either just not implement the KIP,
> or
> increase complexity within the KIP. Further, from an operator perspective,
> if I
> would prefer clients to ship metrics through the systems they already have
> in
> place, now I have to expect that anything that uses librdkafka or the
> official
> Java client will be shipping me metrics that I have to deal with (since
> the KIP
> is default enabled).
>
> Lastly, I'm a little wary that this KIP may stem from a product goal of
> Confluent: since most everything uses librdkafka or the Java client, then
> by
> defaulting clients sending metrics, Confluent gets an easy way to provide
> metric panels for a nice cloud UI. If any client does not want to support
> these
> metrics, and then a user wonders why these hypothetical panels have no
> metrics,
> then Confluent can just reply "use a supported client".  Even if this
> (potentially unlikely) scenario is true, then hooks would still be a great
> alternative, because then Confluent could provide drop-in hooks for any
> client
> and the end result of easy-panels would be the same.
>
> In summary,
>
> - Metrics are more of an organizational concern, not specifically a broker
>   operator concern.
>
> - The proposal seems to hijack how metrics are gathered within
> organizations
>
> - I don't think KIPs should dictate which metrics should be gathered and
> which
>   should not. Clients instead should make it easy for users to gather
> anything
>   they could be interested in, and ignore anything they are not.
>
> - I think hooks are more extensible, more exact, and fit better into
>   organizational workflows.
>
> On 2021/06/02 12:45:45, Magnus Edenhill <mag...@edenhill.se> wrote:
> > Hey all,
> >
> > I'm proposing KIP-714 to add remote Client metrics and observability.
> > This functionality will allow centralized monitoring and troubleshooting
> of
> > clients and their internals.
> >
> > Please see
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >
> > Looking forward to your feedback!
> >
> > Regards,
> > Magnus
> >
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to