Hi, Apoorv, Thanks for the updated KIP. A few more minor comments.
10. Should InFlightMessageCount be Gauge to match InFlightBatchCount? 11. It seems that FetchLockTimeMs should be Histogram since we care about the distribution over time instead of just the current value. Also, it might be useful to have another metric like FetchLockRatio that estimates the fraction of the time a partition is being locked. 12. Sould AcquisitionLockTimeoutCount be AcquisitionLockTimeoutPerSec since it's a Meter? Jun On Wed, Nov 6, 2024 at 6:00 AM Apoorv Mittal <apoorvmitta...@gmail.com> wrote: > Hi All, > Please let me know if there is any other feedback. Else will start a voting > thread tomorrow. > > Regards, > Apoorv Mittal > > > On Tue, Nov 5, 2024 at 12:21 PM Apoorv Mittal <apoorvmitta...@gmail.com> > wrote: > > > Thanks for the feedback Andrew. > > > > AS9: Yes, you are correct it's topic name and not topic id. > > > > AS10: Thanks, I have updated the description. > > > > Regards, > > Apoorv Mittal > > > > > > On Tue, Nov 5, 2024 at 10:56 AM Andrew Schofield < > > andrew_schofield_j...@outlook.com> wrote: > > > >> Hi Apoorv, > >> A couple of additional comments. > >> > >> AS9: The tag share-partition : {group-topic-partition} confirm the > >> format. I suppose it's > >> one tag `group-id:topic-name:partition`, and not topic ID. Topic ID is > >> clearly more > >> authoritative, but I don't think it's usable in metrics. > >> > >> AS10: The description of the AcquisitionLockTimeoutMs would be better as > >> "Tracks the number of acquisition locks for records which are not > >> acknowledged within the timeout." > >> > >> Thanks, > >> Andrew > >> > >> ________________________________________ > >> From: Apoorv Mittal <apoorvmitta...@gmail.com> > >> Sent: 05 November 2024 08:59 > >> To: dev@kafka.apache.org <dev@kafka.apache.org> > >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for cooperative > >> consumption > >> > >> Hi All, > >> Please let me know if there is any other feedback or suggestions. > >> > >> Regards, > >> Apoorv Mittal > >> > >> > >> On Fri, Nov 1, 2024 at 5:30 PM Apoorv Mittal <apoorvmitta...@gmail.com> > >> wrote: > >> > >> > Thanks Jun and Andrew for reviewing and feedback. > >> > > >> > J1: I was thinking of having metrics as per the APIs in > >> > SharePartitionManager but I think you are right. This is something not > >> > beneficial to track as the release should only happen when a client > >> sends a > >> > leave group request. I have removed this metric. > >> > > >> > J2: Moved to Histogram. > >> > > >> > J3: Moved to Meter. I have removed > RequestTopicPartitionsFetchEmptyCount > >> > from KIP (as mentioned by Andrew - AS8) as > >> RequestTopicPartitionsFetchRatio > >> > moved to Histogram. Also I have marked InFlightMessageCount as Meter, > so > >> > shall capture the rate better. > >> > > >> > AS0: Done, specified specifically in the Motivation section. > >> > > >> > AS1: Hmmm, I thought about that earlier and was thinking about keeping > >> the > >> > abstracted share fetch APIs metrics in ShareGroupMetrics itself. > >> However I > >> > think the Share Fetch and Share Acknowledge can be part of > >> BrokerTopicMetrics > >> > as these operations happen by Share Consumer. Do you think we should > >> also > >> > move the RequestTopicPartitionsFetchRatio metric? I think we shouldn't > >> and > >> > it should be in ShareGroupMetrics itself, please let me know what you > >> think. > >> > > >> > AS2: Yes you are right, it can support all topic rate as well. I have > >> > added the suggestion. > >> > > >> > AS3: I have removed this metric as part of J1 comment. > >> > > >> > AS4,AS6: We already have a share-acknowledgement Meter metric defined > in > >> > KIP-932. I am of the opinion to move the KIP-932 share-acknowledgement > >> > metric to TotalShareAcknowledgementRequestsPerSec. Hence I have added > >> > corresponding FailedShareAcknowledgementRequestsPerSec in this KIP. > >> > > >> > AS5: It counts the failure of a share acknowledge request. Similar to > >> > multi topic partition fetch, acknowledge can also be multi topic > >> partition > >> > but a single acknowledge request might either fail completely or at > >> topic > >> > partition level. Hence this metric tracks the failure as similar to > >> > FailedFetchRequestsPerSec or FailedShareFetchRequestsPerSec. > >> > > >> > AS7: I have added some examples and another metric FetchLockTimeMs, > >> which > >> > should help in optimizing behaviour. > >> > > >> > AS8: Yes, we might be able to work with low percentiles to investigate > >> > further hence I have removed RequestTopicPartitionsFetchEmptyCount. > >> > However Histogram will give us percentiles but not the raw bucketed > >> > value, in case we think we need such data explicitly then might > consider > >> > adding it but for now it's removed. > >> > > >> > > >> > Regards, > >> > Apoorv Mittal > >> > > >> > > >> > On Thu, Oct 31, 2024 at 1:03 PM Andrew Schofield < > >> > andrew_schofield_j...@outlook.com> wrote: > >> > > >> >> Hi Apoorv, > >> >> Thanks for the KIP. I have some comments. > >> >> > >> >> AS0: These are broker metrics and not client metrics. I don't think > the > >> >> KIP states that and it probably should. > >> >> > >> >> AS1: TotalShareFetchRequestsPerSec is inspired by the existing metric > >> >> TotalFetchRequestsPerSec. That one is in the BrokerTopicMetrics group > >> >> so I would expect the new metric to be in the same group. This > comment > >> >> applies to all of the ShareGroupMetrics broker metrics scoped to a > >> topic. > >> >> > >> >> AS2: Similarly to TotalFetchRequestPerSec, does the new metric > support > >> >> omitting the topic tag in order to obtain the all-topic rate? I > >> believe so > >> >> because it follows the existing metrics. I guess the tag is actually > >> >> `topic=([-.\w]+)` to copy the existing notation. > >> >> > >> >> AS3: What's a share release? Does it count releases which are made > >> >> by ShareFetch/Acknowledge requests as well as implicit ones as a > >> >> result of share session expiration or closure? > >> >> > >> >> AS4: I see there is FailedShareAcknowledgementRequestsPerSec. > >> >> If this corresponds specifically to the ShareAcknowledge API, then > >> >> the name should be FailedShareAcknowledgeRequestsPerSec. > >> >> > >> >> AS5: What does FailedShareAcknowledgementRequestsPerSec > >> >> actually count? This is a per-topic metric and a ShareAcknowledge > >> >> can mention several topics, but actually it can also operate on many > >> >> acquired records specifying any of the different acknowledgement > >> >> types that KIP-932 defines. > >> >> > >> >> AS6: Why no TotalShareAcknowledgementRequestsPerSec? > >> >> > >> >> AS7: I think I like the concept behind > >> RequestTopicPartitionsFetchRatio, > >> >> but it would really help to include some examples of how it works. > You > >> >> mention 4 potential causes for low ratios, so it would be nice to see > >> how > >> >> each of them would be evident from the metrics and configurations. > >> >> This is going to be valuable information for operating this in > >> production. > >> >> > >> >> AS8: As Jun mentioned, maybe a histogram would be appropriate > >> >> for RequestTopicPartitionsFetchRatio. Then I wonder whether it's > >> >> worth having a separate RequestTopicPartitionsFetchEmptyCount since > >> >> those are requests for which the fetch ratio was 0. > >> >> > >> >> Thanks, > >> >> Andrew > >> >> ________________________________________ > >> >> From: Jun Rao <j...@confluent.io.INVALID> > >> >> Sent: 30 October 2024 23:07 > >> >> To: dev@kafka.apache.org <dev@kafka.apache.org> > >> >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for cooperative > >> >> consumption > >> >> > >> >> Hi, Apoorv, > >> >> > >> >> Thanks for the KIP. A few comments. > >> >> > >> >> 1. TotalShareReleaseRequestsPerSec: This is a bit weird since there > is > >> >> no ShareReleaseRequest. > >> >> > >> >> 2. RequestTopicPartitionsFetchRatio and InFlightBatchMessageCount: It > >> >> seems > >> >> Histogram is more appropriate for them. > >> >> > >> >> 3. RequestTopicPartitionsFetchEmptyCount and > >> AcquisitionLockTimeoutCount: > >> >> The problem with using gauge is that if the value changes quickly, we > >> may > >> >> not be able to capture the occurrence of empty fetch or lock timeout. > >> >> Meter > >> >> captures those rare events better. > >> >> > >> >> Jun > >> >> > >> >> > >> >> On Tue, Oct 29, 2024 at 8:48 AM Apoorv Mittal < > >> apoorvmitta...@gmail.com> > >> >> wrote: > >> >> > >> >> > Hi Everyone, > >> >> > I would like to start a discussion on KIP-1103: > >> >> > > >> >> > > >> >> > >> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1103%3A+Additional+metrics+for+cooperative+consumption > >> >> > > >> >> > This KIP extends KIP-932 to provide additional metrics for > >> >> > Queues/Cooperative consumption. > >> >> > > >> >> > Regards, > >> >> > Apoorv Mittal > >> >> > > >> >> > >> > > >> > > >