Re: [DISCUSS] KIP-714: Client metrics and observability

Jun Rao Wed, 20 Sep 2023 12:17:23 -0700

Hi, Andrew,

Thanks for updating the KIP and sorry for the late response. A few more
comments.

110. Another potential motivation is the multiple clients support. Some of
the places may not have good monitoring support for non-java clients.

111. OpenTelemetry Naming: We replace '-' with '.' for metric name and
replace '-' with '_' for attributes. Why is the inconsistency?

112. OTLP specification: Page is not found from the link.

113. "Defining standard and required metrics makes the monitoring and
troubleshooting of clients from various client types ": Incomplete sentence.

114. standard/required metrics
114.1 Do other clients need to implement those metrics with the exact same
names?
114.2 What happens if some of those metrics are missing from a client?

115. "KafkaConsumer: both the "heart beat" and application threads": We
have an ongoing effort to refactor the consumer threading model (
https://cwiki.apache.org/confluence/display/KAFKA/Consumer+threading+refactor+design).
Once this is done, PRC requests will only be made from the background
thread. Should this KIP follow the new model only?

116. 'The metrics should contain the reason for the client termination by
including the client.terminating metric with the label “reason” ...'. Hmm,
are we introducing a new metric client.terminating? If so, that needs to be
explicitly listed.

117. "As the metrics plugin may need to add additional metrics on top of
this the generic metrics receiver in the broker will not add these labels
but rely on the plugins to do so," The sentence doesn't read well.

118. "it is possible for the client to send at most one accepted
out-of-profile per connection before the rate-limiter kicks in": If we do
this, do we still need the Terminating flag in PushTelemetryRequestV0?

119. "If there is no client metrics receiver plugin configured on the
broker, it will respond to GetTelemetrySubscriptionsRequest with
RequestedMetrics set to Null and a -1 SubscriptionId. The client should
send a new GetTelemetrySubscriptionsRequest after the PushIntervalMs has
expired. This allows the metrics receiver to be enabled or disabled without
having to restart the broker or reset the client connection."
"no client metrics receiver plugin configured" is defined by no metric
reporter implementing the ClientTelemetry interface, right? In that case,
it would be useful to avoid the clients sending
GetTelemetrySubscriptionsRequest periodically to preserve the current
behavior.

120. GetTelemetrySubscriptionsResponseV0 and PushTelemetryRequestV0: Could
we list error codes for each?

121. "ClientTelemetryReceiver.ClientTelemetryReceiver This method may be
called from the request handling thread": Where else can this method be
called?

122. DescribeConfigs/AlterConfigs already exist. Are we changing the ACL?

Thanks,

Jun

On Mon, Jul 31, 2023 at 4:33 AM Andrew Schofield <
andrew_schofield_j...@outlook.com> wrote:

> Hi Milind,
> Thanks for your question.
>
> On reflection, I agree that INVALID_RECORD is most likely to be caused by a
> problem in the serialization in the client. I have changed the client
> action in this case
> to “Log an error and stop pushing metrics”.
>
> I have updated the KIP text accordingly.
>
> Thanks,
> Andrew
>
> > On 31 Jul 2023, at 12:09, Milind Luthra <milut...@confluent.io.INVALID>
> wrote:
> >
> > Hi Andrew,
> > Thanks for the clarifications.
> >
> > About 2b:
> > In case a client has a bug while serializing, it might be difficult for
> the
> > client to recover from that without code changes. In that, it might be
> good
> > to just log the INVALID_RECORD as an error, and treat the error as fatal
> > for the client (only fatal in terms of sending the metrics, the client
> can
> > keep functioning otherwise). What do you think?
> >
> > Thanks
> > Milind
> >
> > On Mon, Jul 24, 2023 at 8:18 PM Andrew Schofield <
> > andrew_schofield_j...@outlook.com> wrote:
> >
> >> Hi Milind,
> >> Thanks for your questions about the KIP.
> >>
> >> 1) I did some archaeology and looked at historical versions of the KIP.
> I
> >> think this is
> >> just a mistake. 5 minutes is the default metric push interval. 30
> minutes
> >> is a mystery
> >> to me. I’ve updated the KIP.
> >>
> >> 2) I think there are two situations in which INVALID_RECORD might occur.
> >> a) The client might perhaps be using a content-type that the broker does
> >> not support.
> >> The KIP mentions content-type as a future extension, but there’s only
> one
> >> supported
> >> to start with. Until we have multiple content-types, this seems out of
> >> scope. I think a
> >> future KIP would add another error code for this.
> >> b) The client might perhaps have a bug which means the metrics payload
> is
> >> malformed.
> >> Logging a warning and attempting the next metrics push on the push
> >> interval seems
> >> appropriate.
> >>
> >> UNKNOWN_SUBSCRIPTION_ID would indeed be handled by making an immediate
> >> GetTelemetrySubscriptionsRequest.
> >>
> >> UNSUPPORTED_COMPRESSION_TYPE seems like either a client bug or perhaps
> >> a situation in which a broker sends a compression type in a
> >> GetTelemetrySubscriptionsResponse
> >> which is subsequently not supported when its used with a
> >> PushTelemetryRequest.
> >> We do want the client to have the opportunity to get an up-to-date list
> of
> >> supported
> >> compression types. I think an immediate GetTelemetrySubscriptionsRequest
> >> is appropriate.
> >>
> >> 3) If a client attempts a subsequent handshake with a Null
> >> ClientInstanceId, the
> >> receiving broker may not already know the client's existing
> >> ClientInstanceId. If the
> >> receiving broker knows the existing ClientInstanceId, it simply responds
> >> the existing
> >> value back to the client. If it does not know the existing
> >> ClientInstanceId, it will create
> >> a new client instance ID and respond with that.
> >>
> >> I will update the KIP with these clarifications.
> >>
> >> Thanks,
> >> Andrew
> >>
> >>> On 17 Jul 2023, at 14:21, Milind Luthra <milut...@confluent.io.INVALID
> >
> >> wrote:
> >>>
> >>> Hi Andrew, thanks for this KIP.
> >>>
> >>> I had a few questions regarding the "Error handling" section.
> >>>
> >>> 1. It mentions that "The 5 and 30 minute retries are to eventually
> >> trigger
> >>> a retry and avoid having to restart clients if the cluster metrics
> >>> configuration is disabled temporarily, e.g., by operator error, rolling
> >>> upgrades, etc."
> >>> But this 30 min interval isn't mentioned anywhere else. What is it
> >>> referring to?
> >>>
> >>> 2. For the actual errors:
> >>> INVALID_RECORD : The action required is to "Log a warning to the
> >>> application and schedule the next GetTelemetrySubscriptionsRequest to 5
> >>> minutes". Why is this 5 minutes, and not something like PushIntervalMs?
> >> And
> >>> also, why are we scheduling a GetTelemetrySubscriptionsRequest in this
> >>> case, if the serialization is broken?
> >>> UNKNOWN_SUBSCRIPTION_ID , UNSUPPORTED_COMPRESSION_TYPE : just to
> confirm,
> >>> the GetTelemetrySubscriptionsRequest needs to be scheduled immediately
> >>> after the PushTelemetry response, correct?
> >>>
> >>> 3. For "Subsequent GetTelemetrySubscriptionsRequests must include the
> >>> ClientInstanceId returned in the first response, regardless of broker":
> >>> Will a broker error be returned in case some implementation of this KIP
> >>> violates this accidentally and sends a request with ClientInstanceId =
> >> Null
> >>> even when it's been obtained already? Or will a new ClientInstanceId be
> >>> returned without an error?
> >>>
> >>> Thanks!
> >>>
> >>> On Tue, Jun 13, 2023 at 8:38 PM Andrew Schofield <
> >>> andrew_schofield_j...@outlook.com> wrote:
> >>>
> >>>> Hi,
> >>>> I would like to start a new discussion thread on KIP-714: Client
> metrics
> >>>> and observability.
> >>>>
> >>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-714%3A+Client+metrics+and+observability
> >>>>
> >>>> I have edited the proposal significantly to reduce the scope. The
> >> overall
> >>>> mechanism for client metric subscriptions is unchanged, but the
> >>>> KIP is now based on the existing client metrics, rather than
> introducing
> >>>> new metrics. The purpose remains helping cluster operators
> >>>> investigate performance problems experienced by clients without
> >> requiring
> >>>> changes to the client application code or configuration.
> >>>>
> >>>> Thanks,
> >>>> Andrew
>
>
>

Re: [DISCUSS] KIP-714: Client metrics and observability

Reply via email to