Hi David, Thanks for your interest in KIP-714. Because this KIP is under development at the same time as KIP-848, it will need to support both the existing KafkaConsumer code and the refactored code being worked on under KIP-848. I’ve updated the Threading section accordingly.
Thanks, Andrew > On 30 Sep 2023, at 01:45, David Jacot <dja...@confluent.io.INVALID> wrote: > > Hi Andrew, > > Thanks for driving this one. I haven't read all the KIP yet but I already > have an initial question. In the Threading section, it is written > "KafkaConsumer: the "background" thread (based on the consumer threading > refactor which is underway)". If I understand this correctly, it means > that KIP-714 won't work if the "old consumer" is used. Am I correct? > > Cheers, > David > > > On Fri, Sep 22, 2023 at 12:18 PM Andrew Schofield < > andrew_schofield_j...@outlook.com> wrote: > >> Hi Philip, >> No, I do not think it should actively search for a broker that supports >> the new >> RPCs. In general, either all of the brokers or none of the brokers will >> support it. >> In the window, where the cluster is being upgraded or client telemetry is >> being >> enabled, there might be a mixed situation. I wouldn’t put too much effort >> into >> this mixed scenario. As the client finds brokers which support the new >> RPCs, >> it can begin to follow the KIP-714 mechanism. >> >> Thanks, >> Andrew >> >>> On 22 Sep 2023, at 20:01, Philip Nee <philip...@gmail.com> wrote: >>> >>> Hi Andrew - >>> >>> Question on top of your answers: Do you think the client should actively >>> search for a broker that supports this RPC? As previously mentioned, the >>> broker uses the leastLoadedNode to find its first connection (am >>> I correct?), and what if that broker doesn't support the metric push? >>> >>> P >>> >>> On Fri, Sep 22, 2023 at 10:20 AM Andrew Schofield < >>> andrew_schofield_j...@outlook.com> wrote: >>> >>>> Hi Kirk, >>>> Thanks for your question. You are correct that the presence or absence >> of >>>> the new RPCs in the >>>> ApiVersionsResponse tells the client whether to request the telemetry >>>> subscriptions and push >>>> metrics. >>>> >>>> This is of course tricky in practice. It would be conceivable, as a >>>> cluster is upgraded to AK 3.7 >>>> or as a client metrics receiver plugin is deployed across the cluster, >>>> that a client connects to some >>>> brokers that support the new RPCs and some that do not. >>>> >>>> Here’s my suggestion: >>>> * If a client is not connected to any brokers that support in the new >>>> RPCs, it cannot push metrics. >>>> * If a client is only connected to brokers that support the new RPCs, it >>>> will use the new RPCs in >>>> accordance with the KIP. >>>> * If a client is connected to some brokers that support the new RPCs and >>>> some that do not, it will >>>> use the new RPCs with the supporting subset of brokers in accordance >> with >>>> the KIP. >>>> >>>> Comments? >>>> >>>> Thanks, >>>> Andrew >>>> >>>>> On 22 Sep 2023, at 16:01, Kirk True <k...@kirktrue.pro> wrote: >>>>> >>>>> Hi Andrew/Jun, >>>>> >>>>> I want to make sure I understand question/comment #119… In the case >>>> where a cluster without a metrics client receiver is later reconfigured >> and >>>> restarted to include a metrics client receiver, do we want the client to >>>> thereafter begin pushing metrics to the cluster? From Andrew’s response >> to >>>> question #119, it sounds like we’re using the presence/absence of the >>>> relevant RPCs in ApiVersionsResponse as the to-push-or-not-to-push >>>> indicator. Do I have that correct? >>>>> >>>>> Thanks, >>>>> Kirk >>>>> >>>>>> On Sep 21, 2023, at 7:42 AM, Andrew Schofield < >>>> andrew_schofield_j...@outlook.com> wrote: >>>>>> >>>>>> Hi Jun, >>>>>> Thanks for your comments. I’ve updated the KIP to clarify where >>>> necessary. >>>>>> >>>>>> 110. Yes, agree. The motivation section mentions this. >>>>>> >>>>>> 111. The replacement of ‘-‘ with ‘.’ for metric names and the >>>> replacement of >>>>>> ‘-‘ with ‘_’ for attribute keys is following the OTLP guidelines. I >>>> think it’s a bit >>>>>> of a debatable point. OTLP makes a distinction between a namespace >> and a >>>>>> multi-word component. If it was “client.id” then “client” would be a >>>> namespace with >>>>>> an attribute key “id”. But “client_id” is just a key. So, it was >>>> intentional, but debatable. >>>>>> >>>>>> 112. Thanks. The link target moved. Fixed. >>>>>> >>>>>> 113. Thanks. Fixed. >>>>>> >>>>>> 114.1. If a standard metric makes sense for a client, it should use >> the >>>> exact same >>>>>> name. If a standard metric doesn’t make sense for a client, then it >> can >>>> omit that metric. >>>>>> >>>>>> For a required metric, the situation is stronger. All clients must >>>> implement these >>>>>> metrics with these names in order to implement the KIP. But the >>>> required metrics >>>>>> are essentially the number of connections and the request latency, >>>> which do not >>>>>> reference the underlying implementation of the client (which >>>> producer.record.queue.time.max >>>>>> of course does). >>>>>> >>>>>> I suppose someone might build a producer-only client that didn’t have >>>> consumer metrics. >>>>>> In this case, the consumer metrics would conceptually have the value 0 >>>> and would not >>>>>> need to be sent to the broker. >>>>>> >>>>>> 114.2. If a client does not implement some metrics, they will not be >>>> available for >>>>>> analysis and troubleshooting. It just makes the ability to combine >>>> metrics from lots >>>>>> different clients less complete. >>>>>> >>>>>> 115. I think it was probably a mistake to be so specific about >>>> threading in this KIP. >>>>>> When the consumer threading refactor is complete, of course, it would >>>> do the appropriate >>>>>> equivalent. I’ve added a clarification and massively simplified this >>>> section. >>>>>> >>>>>> 116. I removed “client.terminating”. >>>>>> >>>>>> 117. Yes. Horrid. Fixed. >>>>>> >>>>>> 118. The Terminating flag just indicates that this is the final >>>> PushTelemetryRequest >>>>>> from this client. Any subsequent request will be rejected. I think >> this >>>> flag should remain. >>>>>> >>>>>> 119. Good catch. This was actually contradicting another part of the >>>> KIP. The current behaviour >>>>>> is indeed preserved. If the broker doesn’t have a client metrics >>>> receiver plugin, the new RPCs >>>>>> in this KIP are “turned off” and not reported in ApiVersionsResponse. >>>> The client will not >>>>>> attempt to push metrics. >>>>>> >>>>>> 120. The error handling table lists the error codes for >>>> PushTelemetryResponse. I’ve added one >>>>>> but it looked good to me. GetTelemetrySubscriptions doesn’t have any >>>> error codes, since the >>>>>> situation in which the client telemetry is not supported is handled by >>>> the RPCs not being offered >>>>>> by the broker. >>>>>> >>>>>> 121. Again, I think it’s probably a mistake to be specific about >>>> threading. Removed. >>>>>> >>>>>> 122. Good catch. For DescribeConfigs, the ACL operation should be >>>>>> “DESCRIBE_CONFIGS”. For AlterConfigs, the ACL operation should be >>>>>> “ALTER” (not “WRITE” as it said). The checks are made on the CLUSTER >>>>>> resource. >>>>>> >>>>>> Thanks for the detailed review. >>>>>> >>>>>> Thanks, >>>>>> Andrew >>>>>> >>>>>>> >>>>>>> 110. Another potential motivation is the multiple clients support. >>>> Some of >>>>>>> the places may not have good monitoring support for non-java clients. >>>>>>> >>>>>>> 111. OpenTelemetry Naming: We replace '-' with '.' for metric name >> and >>>>>>> replace '-' with '_' for attributes. Why is the inconsistency? >>>>>>> >>>>>>> 112. OTLP specification: Page is not found from the link. >>>>>>> >>>>>>> 113. "Defining standard and required metrics makes the monitoring and >>>>>>> troubleshooting of clients from various client types ": Incomplete >>>> sentence. >>>>>>> >>>>>>> 114. standard/required metrics >>>>>>> 114.1 Do other clients need to implement those metrics with the exact >>>> same >>>>>>> names? >>>>>>> 114.2 What happens if some of those metrics are missing from a >> client? >>>>>>> >>>>>>> 115. "KafkaConsumer: both the "heart beat" and application threads": >> We >>>>>>> have an ongoing effort to refactor the consumer threading model ( >>>>>>> >>>> >> https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design >>>> ). >>>>>>> Once this is done, PRC requests will only be made from the background >>>>>>> thread. Should this KIP follow the new model only? >>>>>>> >>>>>>> 116. 'The metrics should contain the reason for the client >> termination >>>> by >>>>>>> including the client.terminating metric with the label “reason” ...'. >>>> Hmm, >>>>>>> are we introducing a new metric client.terminating? If so, that needs >>>> to be >>>>>>> explicitly listed. >>>>>>> >>>>>>> 117. "As the metrics plugin may need to add additional metrics on top >>>> of >>>>>>> this the generic metrics receiver in the broker will not add these >>>> labels >>>>>>> but rely on the plugins to do so," The sentence doesn't read well. >>>>>>> >>>>>>> 118. "it is possible for the client to send at most one accepted >>>>>>> out-of-profile per connection before the rate-limiter kicks in": If >> we >>>> do >>>>>>> this, do we still need the Terminating flag in >> PushTelemetryRequestV0? >>>>>>> >>>>>>> 119. "If there is no client metrics receiver plugin configured on the >>>>>>> broker, it will respond to GetTelemetrySubscriptionsRequest with >>>>>>> RequestedMetrics set to Null and a -1 SubscriptionId. The client >> should >>>>>>> send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs >>>> has >>>>>>> expired. This allows the metrics receiver to be enabled or disabled >>>> without >>>>>>> having to restart the broker or reset the client connection." >>>>>>> "no client metrics receiver plugin configured" is defined by no >> metric >>>>>>> reporter implementing the ClientTelemetry interface, right? In that >>>> case, >>>>>>> it would be useful to avoid the clients sending >>>>>>> GetTelemetrySubscriptionsRequest periodically to preserve the current >>>>>>> behavior. >>>>>>> >>>>>>> 120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0: >>>> Could >>>>>>> we list error codes for each? >>>>>>> >>>>>>> 121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may >>>> be >>>>>>> called from the request handling thread": Where else can this method >> be >>>>>>> called? >>>>>>> >>>>>>> 122. DescribeConfigs/AlterConfigs already exist. Are we changing the >>>> ACL? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Jun >>>>>>> >>>>>>> On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield < >>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>> >>>>>>>> Hi Milind, >>>>>>>> Thanks for your question. >>>>>>>> >>>>>>>> On reflection, I agree that INVALID_RECORD is most likely to be >>>> caused by a >>>>>>>> problem in the serialization in the client. I have changed the >> client >>>>>>>> action in this case >>>>>>>> to “Log an error and stop pushing metrics”. >>>>>>>> >>>>>>>> I have updated the KIP text accordingly. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Andrew >>>>>>>> >>>>>>>>> On 31 Jul 2023, at 12:09, Milind Luthra >>>> <milut...@confluent.io.INVALID> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Andrew, >>>>>>>>> Thanks for the clarifications. >>>>>>>>> >>>>>>>>> About 2b: >>>>>>>>> In case a client has a bug while serializing, it might be difficult >>>> for >>>>>>>> the >>>>>>>>> client to recover from that without code changes. In that, it might >>>> be >>>>>>>> good >>>>>>>>> to just log the INVALID_RECORD as an error, and treat the error as >>>> fatal >>>>>>>>> for the client (only fatal in terms of sending the metrics, the >>>> client >>>>>>>> can >>>>>>>>> keep functioning otherwise). What do you think? >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Milind >>>>>>>>> >>>>>>>>> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < >>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Milind, >>>>>>>>>> Thanks for your questions about the KIP. >>>>>>>>>> >>>>>>>>>> 1) I did some archaeology and looked at historical versions of the >>>> KIP. >>>>>>>> I >>>>>>>>>> think this is >>>>>>>>>> just a mistake. 5 minutes is the default metric push interval. 30 >>>>>>>> minutes >>>>>>>>>> is a mystery >>>>>>>>>> to me. I’ve updated the KIP. >>>>>>>>>> >>>>>>>>>> 2) I think there are two situations in which INVALID_RECORD might >>>> occur. >>>>>>>>>> a) The client might perhaps be using a content-type that the >> broker >>>> does >>>>>>>>>> not support. >>>>>>>>>> The KIP mentions content-type as a future extension, but there’s >>>> only >>>>>>>> one >>>>>>>>>> supported >>>>>>>>>> to start with. Until we have multiple content-types, this seems >> out >>>> of >>>>>>>>>> scope. I think a >>>>>>>>>> future KIP would add another error code for this. >>>>>>>>>> b) The client might perhaps have a bug which means the metrics >>>> payload >>>>>>>> is >>>>>>>>>> malformed. >>>>>>>>>> Logging a warning and attempting the next metrics push on the push >>>>>>>>>> interval seems >>>>>>>>>> appropriate. >>>>>>>>>> >>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an >>>> immediate >>>>>>>>>> GetTelemetrySubscriptionsRequest. >>>>>>>>>> >>>>>>>>>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or >>>> perhaps >>>>>>>>>> a situation in which a broker sends a compression type in a >>>>>>>>>> GetTelemetrySubscriptionsResponse >>>>>>>>>> which is subsequently not supported when its used with a >>>>>>>>>> PushTelemetryRequest. >>>>>>>>>> We do want the client to have the opportunity to get an up-to-date >>>> list >>>>>>>> of >>>>>>>>>> supported >>>>>>>>>> compression types. I think an immediate >>>> GetTelemetrySubscriptionsRequest >>>>>>>>>> is appropriate. >>>>>>>>>> >>>>>>>>>> 3) If a client attempts a subsequent handshake with a Null >>>>>>>>>> ClientInstanceId, the >>>>>>>>>> receiving broker may not already know the client's existing >>>>>>>>>> ClientInstanceId. If the >>>>>>>>>> receiving broker knows the existing ClientInstanceId, it simply >>>> responds >>>>>>>>>> the existing >>>>>>>>>> value back to the client. If it does not know the existing >>>>>>>>>> ClientInstanceId, it will create >>>>>>>>>> a new client instance ID and respond with that. >>>>>>>>>> >>>>>>>>>> I will update the KIP with these clarifications. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Andrew >>>>>>>>>> >>>>>>>>>>> On 17 Jul 2023, at 14:21, Milind Luthra >>>> <milut...@confluent.io.INVALID >>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Andrew, thanks for this KIP. >>>>>>>>>>> >>>>>>>>>>> I had a few questions regarding the "Error handling" section. >>>>>>>>>>> >>>>>>>>>>> 1. It mentions that "The 5 and 30 minute retries are to >> eventually >>>>>>>>>> trigger >>>>>>>>>>> a retry and avoid having to restart clients if the cluster >> metrics >>>>>>>>>>> configuration is disabled temporarily, e.g., by operator error, >>>> rolling >>>>>>>>>>> upgrades, etc." >>>>>>>>>>> But this 30 min interval isn't mentioned anywhere else. What is >> it >>>>>>>>>>> referring to? >>>>>>>>>>> >>>>>>>>>>> 2. For the actual errors: >>>>>>>>>>> INVALID_RECORD : The action required is to "Log a warning to the >>>>>>>>>>> application and schedule the next >> GetTelemetrySubscriptionsRequest >>>> to 5 >>>>>>>>>>> minutes". Why is this 5 minutes, and not something like >>>> PushIntervalMs? >>>>>>>>>> And >>>>>>>>>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in >>>> this >>>>>>>>>>> case, if the serialization is broken? >>>>>>>>>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to >>>>>>>> confirm, >>>>>>>>>>> the GetTelemetrySubscriptionsRequest needs to be scheduled >>>> immediately >>>>>>>>>>> after the PushTelemetry response, correct? >>>>>>>>>>> >>>>>>>>>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include >>>> the >>>>>>>>>>> ClientInstanceId returned in the first response, regardless of >>>> broker": >>>>>>>>>>> Will a broker error be returned in case some implementation of >>>> this KIP >>>>>>>>>>> violates this accidentally and sends a request with >>>> ClientInstanceId = >>>>>>>>>> Null >>>>>>>>>>> even when it's been obtained already? Or will a new >>>> ClientInstanceId be >>>>>>>>>>> returned without an error? >>>>>>>>>>> >>>>>>>>>>> Thanks! >>>>>>>>>>> >>>>>>>>>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < >>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> I would like to start a new discussion thread on KIP-714: Client >>>>>>>> metrics >>>>>>>>>>>> and observability. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability >>>>>>>>>>>> >>>>>>>>>>>> I have edited the proposal significantly to reduce the scope. >> The >>>>>>>>>> overall >>>>>>>>>>>> mechanism for client metric subscriptions is unchanged, but the >>>>>>>>>>>> KIP is now based on the existing client metrics, rather than >>>>>>>> introducing >>>>>>>>>>>> new metrics. The purpose remains helping cluster operators >>>>>>>>>>>> investigate performance problems experienced by clients without >>>>>>>>>> requiring >>>>>>>>>>>> changes to the client application code or configuration. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Andrew >> >> >>