Hi Milind, Thanks for your question. On reflection, I agree that INVALID_RECORD is most likely to be caused by a problem in the serialization in the client. I have changed the client action in this case to “Log an error and stop pushing metrics”.
I have updated the KIP text accordingly. Thanks, Andrew > On 31 Jul 2023, at 12:09, Milind Luthra <milut...@confluent.io.INVALID> wrote: > > Hi Andrew, > Thanks for the clarifications. > > About 2b: > In case a client has a bug while serializing, it might be difficult for the > client to recover from that without code changes. In that, it might be good > to just log the INVALID_RECORD as an error, and treat the error as fatal > for the client (only fatal in terms of sending the metrics, the client can > keep functioning otherwise). What do you think? > > Thanks > Milind > > On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield < > andrew_schofield_j...@outlook.com> wrote: > >> Hi Milind, >> Thanks for your questions about the KIP. >> >> 1) I did some archaeology and looked at historical versions of the KIP. I >> think this is >> just a mistake. 5 minutes is the default metric push interval. 30 minutes >> is a mystery >> to me. I’ve updated the KIP. >> >> 2) I think there are two situations in which INVALID_RECORD might occur. >> a) The client might perhaps be using a content-type that the broker does >> not support. >> The KIP mentions content-type as a future extension, but there’s only one >> supported >> to start with. Until we have multiple content-types, this seems out of >> scope. I think a >> future KIP would add another error code for this. >> b) The client might perhaps have a bug which means the metrics payload is >> malformed. >> Logging a warning and attempting the next metrics push on the push >> interval seems >> appropriate. >> >> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate >> GetTelemetrySubscriptionsRequest. >> >> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps >> a situation in which a broker sends a compression type in a >> GetTelemetrySubscriptionsResponse >> which is subsequently not supported when its used with a >> PushTelemetryRequest. >> We do want the client to have the opportunity to get an up-to-date list of >> supported >> compression types. I think an immediate GetTelemetrySubscriptionsRequest >> is appropriate. >> >> 3) If a client attempts a subsequent handshake with a Null >> ClientInstanceId, the >> receiving broker may not already know the client's existing >> ClientInstanceId. If the >> receiving broker knows the existing ClientInstanceId, it simply responds >> the existing >> value back to the client. If it does not know the existing >> ClientInstanceId, it will create >> a new client instance ID and respond with that. >> >> I will update the KIP with these clarifications. >> >> Thanks, >> Andrew >> >>> On 17 Jul 2023, at 14:21, Milind Luthra <milut...@confluent.io.INVALID> >> wrote: >>> >>> Hi Andrew, thanks for this KIP. >>> >>> I had a few questions regarding the "Error handling" section. >>> >>> 1. It mentions that "The 5 and 30 minute retries are to eventually >> trigger >>> a retry and avoid having to restart clients if the cluster metrics >>> configuration is disabled temporarily, e.g., by operator error, rolling >>> upgrades, etc." >>> But this 30 min interval isn't mentioned anywhere else. What is it >>> referring to? >>> >>> 2. For the actual errors: >>> INVALID_RECORD : The action required is to "Log a warning to the >>> application and schedule the next GetTelemetrySubscriptionsRequest to 5 >>> minutes". Why is this 5 minutes, and not something like PushIntervalMs? >> And >>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in this >>> case, if the serialization is broken? >>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to confirm, >>> the GetTelemetrySubscriptionsRequest needs to be scheduled immediately >>> after the PushTelemetry response, correct? >>> >>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the >>> ClientInstanceId returned in the first response, regardless of broker": >>> Will a broker error be returned in case some implementation of this KIP >>> violates this accidentally and sends a request with ClientInstanceId = >> Null >>> even when it's been obtained already? Or will a new ClientInstanceId be >>> returned without an error? >>> >>> Thanks! >>> >>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield < >>> andrew_schofield_j...@outlook.com> wrote: >>> >>>> Hi, >>>> I would like to start a new discussion thread on KIP-714: Client metrics >>>> and observability. >>>> >>>> >>>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability >>>> >>>> I have edited the proposal significantly to reduce the scope. The >> overall >>>> mechanism for client metric subscriptions is unchanged, but the >>>> KIP is now based on the existing client metrics, rather than introducing >>>> new metrics. The purpose remains helping cluster operators >>>> investigate performance problems experienced by clients without >> requiring >>>> changes to the client application code or configuration. >>>> >>>> Thanks, >>>> Andrew