Hi Magnus, Thanks for the KIP.
1. Did you consider using a `default ClientTelemetryReceiver clientReceiver() { return null; }` method on the existing MetricsReporter interface, avoiding the need for the ClientTelemetry trait? 2. On the metrics naming and format, I wasn't really clear about what's being proposed. I assume we're taking a subset of the existing client metrics and representing them as OpenTelemetry metrics, but it didn't really explain how the existing metric names would be mapped to meter and instrument names. Or did I misunderstand? 3. In the client behaviour section it doesn't explicitly say whether the client uses a dedicated thread for this work (I assume it does). 4. The description of the FunctionalityNotEnabled error code suggests that PushTelemetryRequest would only be included in an ApiVersions response if the broker was configured with a plugin. I think the ApiVersionsResponse is normally a constant response (not dependent on broker config), so I wonder whether this is really a precedent we want to set here? Surely in a broker without a plugin configured it could just return an empty set of RequestedMetrics and a maxint NextPushMs in the PushTelemetryResponse? 5. Maybe the AcceptedContentTypes should be documented to be in priority order. That would simplify the action for UnsupportedCompressionType. 6. """As the client will not know the broker id of its bootstrap servers the broker_id label should be set to “bootstrap”.""" Maybe using the same convention as is used in the NetworkClient, where bootstrap servers are the id of the negative of their index in the list? 7. Maybe call it "client.process.rss.bytes" rather than "client.process.memory.bytes", to be explicit? 8. It's a little confusing that --id option to kafka-client-metrics.sh can be a prefix or an exact match. Perhaps --id and --id-prefix would be clearer. 9. Maybe I missed it, but does the client continue to push metrics to the same broker as it randomly picked initially? If it gets disconnected from that broker what happens, does it just randomly pick another? 10. To subscribe to all metrics I assume I can just do `kafka-client-metrics.sh ... --metric ''`? It might be worth saying this explicitly. AFAICS this is the only way to find out all the metrics supported by a client if you don't already know from the client's software version. Kind regards, Tom On Fri, Jun 18, 2021 at 9:39 PM Travis Bischel <travis.bisc...@gmail.com> wrote: > H Colin (and Magnus), > > Thanks for the replies! > > I think the biggest concern I have is the cardinality bits. I'm > sympathetic to the aspect of this making it easier for Kafka brokers to > understand *every* aspect of the kafka ecoystem. I am not sure this will > 100% solve the need there, though: if a client is unable to connect to a > broker, visibility disappears immediately, no? > > I do still think that the problem of difficulty of monitoring within an > organization results from issues within organizations themselves: orgs > should have proper processes in place such that anything talking to Kafka > has the org's plug-in monitoring libraries. Kafka operators can define > those libraries, such that all clients in the org have the libraries the > operators require. This satisfies the same goals this KIP aims to provide, > albeit with the increased org cost of not just having something defined to > be plugged in. > > If Kafka operators themselves can which metrics they want, so that the > broker can tell the client "only send these metrics", then my biggest > concern is removed. > > I do still think that hooks can be a cleaner abstraction to this same > goal, and then pre-provided libraries (say, "this library provides X,Y,Z > and sends to prometheus from your client") could exist that more exactly > satisfy what this KIP aims to provide. This would also avoid the > kitchen-sink vs. not-comprehensive-enough issue I brought up previously. > This would also avoid require KIPs for any supported metrics. > > On 2021/06/16 22:27:55, "Colin McCabe" <cmcc...@apache.org> wrote: > > On Sun, Jun 13, 2021, at 21:51, Travis Bischel wrote: > > > Hi! I have a few thoughts on this KIP. First, I'd like to thank you > for > > > the writeup, > > > clearly a lot of thought has gone into it and it is very thorough. > > > However, I'm not > > > convinced it's the right approach from a fundamental level. > > > > > > Fundamentally, this KIP seems like somewhat of a solution to an > organizational > > > problem. Metrics are organizational concerns, not Kafka operator > concerns. > > > > Hi Travis, > > > > Metrics are certainly Kafka operator concerns. It is very important for > cluster operators to know things like how many clients there are, what they > clients are doing, and so forth. This information is needed to administer > Kafka. Therefore it certainly falls in the domain of the Kafka operations > team (and the Kafka development team.) > > > > We have added many metrics in the past to make it easier to monitor > clients. I think this is just another step in that direction. > > > > > Clients should make it easy to plug in metrics (this is the approach I > take in > > > my own client), and organizations should have processes such that all > clients > > > gather and ship metrics how that organization desires. > > > > > > If an organization is set up correctly, there is no reason for metrics > to be > > > forwarded through Kafka. This feels like a solution to an organization > not > > > properly setting up how processes ship metrics, and in some ways, it's > an > > > overbroad solution, and in other ways, it doesn't cover the entire > problem. > > > > I think the reason was explained pretty clearly: many admins find it > difficult to set up monitoring for every client in the organization. In > general the team which maintains a Kafka cluster is often separate from the > teams that use the cluster. Therefore rolling out monitoring for clients > can be very difficult to coordinate. > > > > No metrics will ever cover every possible use-case, but the set proposed > here does seem useful. > > > > > > > > From the perspective of Kafka operators, it is easy to see that this > KIP is > > > nice in that it just dictates what clients should support for metrics > and that > > > the metrics should ship through Kafka. But, from the perspective of an > > > observability team, this workflow is basically hijacking the standard > flow that > > > organizations may have. I would rather have applications collect > metrics and > > > ship them the same way every other application does. I'd rather not > have to > > > configure additional plugins within Kafka to take metrics and forward > them. > > > > This change doesn't remove any functionality. If you don't want to use > KIP-714 metrics collection, you can simply turn it off and continue > collecting metrics the way you always have. > > > > > > > > More importantly, this KIP prescibes cardinality problems, requires > that to > > > officially support the KIP a client must support all relevant metrics > within > > > the KIP, and requires that a client cannot support other metrics > unless those > > > other metrics also go through a KIP process. It is difficult to > imagine all of > > > these metrics being relevant to every organization, and there is no > way for an > > > organization to filter what is relevant within the client. Instead, the > > > filtering is pushed downwards, meaning more network IO and more CPU > costs to > > > filter what is irrelevant and aggregate what needs to be aggregated, > and more > > > time for an organization to setup whatever it is that will be doing > this > > > filtering and aggregating. Contrast this with a client that enables > hooking in > > > to capture numbers that are relevant within an org itself: the org can > gather > > > what they want, ship only want they want, and ship directly to the > > > observability system they have already set up. As an aside, it may > also be > > > wise to avoid shipping metrics through Kafka about client interaction > with > > > Kafka, because if Kafka is having problems, then orgs lose insight > into those > > > problems. This would be like statuspage using itself for status on its > own > > > systems. > > > > > > Another downside is that by dictating the important metrics, this KIP > either > > > has two choices: try to choose what is important to every org, and > inevitably > > > leave out something important to somebody else, or just add everything > and let > > > the orgs filter. This KIP mostly looks to go with the latter approach, > meaning > > > orgs will be shipping & filtering. With hooks, an org would be able to > gather > > > exactly what they want. > > > > I actually do agree with this criticism to some extent. It would be good > if the broker could specify what metrics it wants, and the clients would > send only those metrics. > > > > More generally, I'd like to see this split up into several RPCs rather > than one mega-RPC. > > > > Maybe something like > > 1. RegisterClient{Request,Response} > > 2. ClientMetricsReport{Request,Response} > > 3. UnregisterClient{Request,Response} > > > > Then the broker can communicate which metrics it wants in > RegisterClientResponse. It can also assign a client instance ID (which I > think should be a UUID, not another string). > > > > > > > > As well, I expect that org applications have metrics on the state of > the > > > applications outside of the Kafka client. Applications are already > sending > > > non-Kafka-client related metrics outbound to observability systems. If > a Kafka > > > client provided hooks, then users could just gather the additional > relevant > > > Kafka client metrics and ship those metrics the same way they do all > of their > > > other metrics. It feels a bit odd for a Kafka client to have its own > separate > > > way of forwarding metrics. Another benefit hooks in clients is that > > > organizations do not _have_ to set up additional plugins to forward > metrics > > > from Kafka. Hooks avoid extra organizational work. > > > > Again, if you want to continue collecting metrics directly from clients, > you can simply do that. Nothing has to change for you as a result of this > KIP. > > > > > > > > The option that the KIP provides for users of clients to opt out of > metrics may > > > avoid some of the above issues (by just disabling things at the user > level), > > > but that's not really great from the perspective of client authors, > because the > > > existence of this KIP forces authors to either just not implement the > KIP, or > > > increase complexity within the KIP. Further, from an operator > perspective, if I > > > would prefer clients to ship metrics through the systems they already > have in > > > place, now I have to expect that anything that uses librdkafka or the > official > > > Java client will be shipping me metrics that I have to deal with > (since the KIP > > > is default enabled). > > > > It is clear that we want to avoid unnecessary complexity. However, the > ability to easily gather metrics from clients without doing a lot of > client-side configuration is not an ability that we currently have, and > this KIP changes that. So I think it's very well worth it. > > > > > > > > Lastly, I'm a little wary that this KIP may stem from a product goal of > > > Confluent: since most everything uses librdkafka or the Java client, > then by > > > defaulting clients sending metrics, Confluent gets an easy way to > provide > > > metric panels for a nice cloud UI. If any client does not want to > support these > > > metrics, and then a user wonders why these hypothetical panels have no > metrics, > > > then Confluent can just reply "use a supported client". Even if this > > > (potentially unlikely) scenario is true, then hooks would still be a > great > > > alternative, because then Confluent could provide drop-in hooks for > any client > > > and the end result of easy-panels would be the same. > > > > > > > In general, if a feature provides a benefit to users and operators, > that's a reason to put it in, not a reason to leave it out. We want to make > Kafka better, which includes making it better for vendors. There is nothing > Confluent-specific in the proposal. > > > > > In summary, > > > > > > - Metrics are more of an organizational concern, not specifically a > broker > > > operator concern. > > > > > > - The proposal seems to hijack how metrics are gathered within > organizations > > > > > > - I don't think KIPs should dictate which metrics should be gathered > and which > > > should not. Clients instead should make it easy for users to gather > anything > > > they could be interested in, and ignore anything they are not. > > > > KIPs have always dictated which metrics are gathered and which should > not. This is very intentionally part of the KIP process (since metrics are > public API). > > > > best, > > Colin > > > > > > > > - I think hooks are more extensible, more exact, and fit better into > > > organizational workflows. > > > > > > On 2021/06/02 12:45:45, Magnus Edenhill <mag...@edenhill.se> wrote: > > > > Hey all, > > > > > > > > I'm proposing KIP-714 to add remote Client metrics and observability. > > > > This functionality will allow centralized monitoring and > troubleshooting of > > > > clients and their internals. > > > > > > > > Please see > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability > > > > > > > > Looking forward to your feedback! > > > > > > > > Regards, > > > > Magnus > > > > > > > > > > >