Re: [DISCUSS] KIP-714: Client metrics and observability

Andrew Schofield Mon, 31 Jul 2023 04:32:47 -0700

Hi Milind,
Thanks for your question.

On reflection, I agree that INVALID_RECORD is most likely to be caused by a
problem in the serialization in the client. I have changed the client action in 
this case
to “Log an error and stop pushing metrics”.


I have updated the KIP text accordingly.

Thanks,
Andrew

> On 31 Jul 2023, at 12:09, Milind Luthra <milut...@confluent.io.INVALID> wrote:
>
> Hi Andrew,
> Thanks for the clarifications.
>
> About 2b:
> In case a client has a bug while serializing, it might be difficult for the
> client to recover from that without code changes. In that, it might be good
> to just log the INVALID_RECORD as an error, and treat the error as fatal
> for the client (only fatal in terms of sending the metrics, the client can
> keep functioning otherwise). What do you think?
>
> Thanks
> Milind
>
> On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> andrew_schofield_j...@outlook.com> wrote:
>
>> Hi Milind,
>> Thanks for your questions about the KIP.
>>
>> 1) I did some archaeology and looked at historical versions of the KIP. I
>> think this is
>> just a mistake. 5 minutes is the default metric push interval. 30 minutes
>> is a mystery
>> to me. I’ve updated the KIP.
>>
>> 2) I think there are two situations in which INVALID_RECORD might occur.
>> a) The client might perhaps be using a content-type that the broker does
>> not support.
>> The KIP mentions content-type as a future extension, but there’s only one
>> supported
>> to start with. Until we have multiple content-types, this seems out of
>> scope. I think a
>> future KIP would add another error code for this.
>> b) The client might perhaps have a bug which means the metrics payload is
>> malformed.
>> Logging a warning and attempting the next metrics push on the push
>> interval seems
>> appropriate.
>>
>> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate
>> GetTelemetrySubscriptionsRequest.
>>
>> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps
>> a situation in which a broker sends a compression type in a
>> GetTelemetrySubscriptionsResponse
>> which is subsequently not supported when its used with a
>> PushTelemetryRequest.
>> We do want the client to have the opportunity to get an up-to-date list of
>> supported
>> compression types. I think an immediate GetTelemetrySubscriptionsRequest
>> is appropriate.
>>
>> 3) If a client attempts a subsequent handshake with a Null
>> ClientInstanceId, the
>> receiving broker may not already know the client's existing
>> ClientInstanceId. If the
>> receiving broker knows the existing ClientInstanceId, it simply responds
>> the existing
>> value back to the client. If it does not know the existing
>> ClientInstanceId, it will create
>> a new client instance ID and respond with that.
>>
>> I will update the KIP with these clarifications.
>>
>> Thanks,
>> Andrew
>>
>>> On 17 Jul 2023, at 14:21, Milind Luthra <milut...@confluent.io.INVALID>
>> wrote:
>>>
>>> Hi Andrew, thanks for this KIP.
>>>
>>> I had a few questions regarding the "Error handling" section.
>>>
>>> 1. It mentions that "The 5 and 30 minute retries are to eventually
>> trigger
>>> a retry and avoid having to restart clients if the cluster metrics
>>> configuration is disabled temporarily, e.g., by operator error, rolling
>>> upgrades, etc."
>>> But this 30 min interval isn't mentioned anywhere else. What is it
>>> referring to?
>>>
>>> 2. For the actual errors:
>>> INVALID_RECORD : The action required is to "Log a warning to the
>>> application and schedule the next GetTelemetrySubscriptionsRequest to 5
>>> minutes". Why is this 5 minutes, and not something like PushIntervalMs?
>> And
>>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in this
>>> case, if the serialization is broken?
>>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to confirm,
>>> the GetTelemetrySubscriptionsRequest needs to be scheduled immediately
>>> after the PushTelemetry response, correct?
>>>
>>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the
>>> ClientInstanceId returned in the first response, regardless of broker":
>>> Will a broker error be returned in case some implementation of this KIP
>>> violates this accidentally and sends a request with ClientInstanceId =
>> Null
>>> even when it's been obtained already? Or will a new ClientInstanceId be
>>> returned without an error?
>>>
>>> Thanks!
>>>
>>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
>>> andrew_schofield_j...@outlook.com> wrote:
>>>
>>>> Hi,
>>>> I would like to start a new discussion thread on KIP-714: Client metrics
>>>> and observability.
>>>>
>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
>>>>
>>>> I have edited the proposal significantly to reduce the scope. The
>> overall
>>>> mechanism for client metric subscriptions is unchanged, but the
>>>> KIP is now based on the existing client metrics, rather than introducing
>>>> new metrics. The purpose remains helping cluster operators
>>>> investigate performance problems experienced by clients without
>> requiring
>>>> changes to the client application code or configuration.
>>>>
>>>> Thanks,
>>>> Andrew

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to