Thanks for the update, Apoorv. The KIP looks to me.
Jun On Wed, Nov 13, 2024 at 11:36 AM Apoorv Mittal <apoorvmitta...@gmail.com> wrote: > Sure Jun, > My thought process was regarding if there exists a subset of fetch requests > which takes considerably higher fetch time than others, then Meter will not > reflect that but Histogram can. However I agree on the aggregation part and > moved the metric to Meter. > > Regards, > Apoorv Mittal > > > On Wed, Nov 13, 2024 at 7:23 PM Jun Rao <j...@confluent.io.invalid> wrote: > > > Hi, Apoorv, > > > > There is some difference between the two. For > > RequestTopicPartitionsFetchRatio, we measure a ratio per request and it > > makes sense to aggregate them as a histogram at the group level. > > For FetchLockRatio, if a request locks a partition for a certain amount > of > > time, its impact is directly reflected in FetchLockRatio. So, in some > > sense, FetchLockRatio is already the result of aggregation across > different > > requests and there is no need for further aggregation. > > > > Thanks, > > > > Jun > > > > On Wed, Nov 13, 2024 at 10:41 AM Apoorv Mittal <apoorvmitta...@gmail.com > > > > wrote: > > > > > HI Jun, > > > I think it would be worth to see the distribution as well for > > > FetchLockRatio as we have the distribution captured for > > > RequestTopicPartitionsFetchRatio. The latter has been already marked as > > > Histogram, per your earlier suggestion. > > > > > > Regards, > > > Apoorv Mittal > > > > > > > > > On Wed, Nov 13, 2024 at 5:53 PM Jun Rao <j...@confluent.io.invalid> > > wrote: > > > > > > > Hi, Apoorv, > > > > > > > > Thanks for the updated KIP. Just one minor comment. > > > > > > > > Why is FetchLockRatio a Histogram? Currently, all ratios are measured > > as > > > > Meter. > > > > > > > > Jun > > > > > > > > On Thu, Nov 7, 2024 at 4:05 PM Apoorv Mittal < > apoorvmitta...@gmail.com > > > > > > > wrote: > > > > > > > > > Hi Jun, > > > > > > > > > > 11. Got it, agree. I was thinking either to have FetchLockRatio or > > > > > FetchLockIdleRatio, though both of them should convey the same > > > > > information. I have added FetchLockRatio as per your suggestion. > > > > > Also I have added TopicPartitionsAcquireTimeMs which can track the > > time > > > > > spent by any share fetch request to acquire any share partition. > It's > > > > hard > > > > > to track the amount of time spent on waiting for a single share > > > partition > > > > > but we can track the time spent to acquire any. > > > > > > > > > > 14. Corrected, thanks. > > > > > > > > > > Regards, > > > > > Apoorv Mittal > > > > > > > > > > > > > > > On Thu, Nov 7, 2024 at 10:16 PM Jun Rao <j...@confluent.io.invalid> > > > > wrote: > > > > > > > > > > > Hi, Apoorv, > > > > > > > > > > > > 11. I was thinking that it would be useful to know if there is > any > > > > > tension > > > > > > when acquiring the share partition lock. This is not quite > captured > > > by > > > > > > FetchLockTimeMs. Perhaps we could add another metric that > measures > > > the > > > > > > amount of time that a request needs to wait before acquiring a > > lock. > > > > > > > > > > > > 14. Should we change share-partition={group-topic-partition} now > > that > > > > we > > > > > > have separate tags for group, topic and partition? > > > > > > > > > > > > Thanks, > > > > > > > > > > > > Jun > > > > > > > > > > > > On Thu, Nov 7, 2024 at 12:51 PM Apoorv Mittal < > > > > apoorvmitta...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi Jun and Chia, > > > > > > > Thanks for the feedback. > > > > > > > > > > > > > > 10. Done. > > > > > > > > > > > > > > 11. I have moved FetchLockTimeMs to Histogram. I didn't > > understand > > > > > > > which ratio shall FetchLockRatio Metric should hold i.e. > against > > > what > > > > > > value > > > > > > > we should compare the partitions locked. Though I have added > > Metric > > > > > > > RequestTopicPartitionsFetchRatio which provides a ratio of > > > partition > > > > > > locked > > > > > > > vs requested. Shall we have something else as well? > > > > > > > > > > > > > > 12. Yeah, true. Done. > > > > > > > > > > > > > > AS0: Make sense, I have moved to group, topic, partition tags > as > > > like > > > > > > > Kafka.cluster Partition metrics. > > > > > > > > > > > > > > AS1: Successful AcknowledgementRequests are also important and > > they > > > > are > > > > > > > already defined in KIP-932 broker metrics. We shall change the > > > metric > > > > > > name > > > > > > > for successful AcknowledgementRequests in KIP-932 itself while > > > > > > implementing > > > > > > > these metrics. > > > > > > > > > > > > > > Regards, > > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > > > > On Thu, Nov 7, 2024 at 7:36 PM Chia-Ping Tsai < > > chia7...@apache.org > > > > > > > > > > wrote: > > > > > > > > > > > > > > > hi Apoorv > > > > > > > > > > > > > > > > sorry for late response: > > > > > > > > > > > > > > > > AS0: How can users distinguish between the group and topic in > > the > > > > > > format > > > > > > > > share-partition={group-topic-partition} since `-` is a valid > > > > > character > > > > > > in > > > > > > > > both group and topic? > > > > > > > > > > > > > > > > AS1: Why do we only have metrics for failed > > > > AcknowledgementRequests? > > > > > > > > Aren't successful AcknowledgementRequests also important for > > > > metrics? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Chia-Ping > > > > > > > > > > > > > > > > On 2024/11/06 13:58:08 Apoorv Mittal wrote: > > > > > > > > > Hi All, > > > > > > > > > Please let me know if there is any other feedback. Else > will > > > > start > > > > > a > > > > > > > > voting > > > > > > > > > thread tomorrow. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 5, 2024 at 12:21 PM Apoorv Mittal < > > > > > > > apoorvmitta...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks for the feedback Andrew. > > > > > > > > > > > > > > > > > > > > AS9: Yes, you are correct it's topic name and not topic > id. > > > > > > > > > > > > > > > > > > > > AS10: Thanks, I have updated the description. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 5, 2024 at 10:56 AM Andrew Schofield < > > > > > > > > > > andrew_schofield_j...@outlook.com> wrote: > > > > > > > > > > > > > > > > > > > >> Hi Apoorv, > > > > > > > > > >> A couple of additional comments. > > > > > > > > > >> > > > > > > > > > >> AS9: The tag share-partition : {group-topic-partition} > > > confirm > > > > > the > > > > > > > > > >> format. I suppose it's > > > > > > > > > >> one tag `group-id:topic-name:partition`, and not topic > ID. > > > > Topic > > > > > > ID > > > > > > > is > > > > > > > > > >> clearly more > > > > > > > > > >> authoritative, but I don't think it's usable in metrics. > > > > > > > > > >> > > > > > > > > > >> AS10: The description of the AcquisitionLockTimeoutMs > > would > > > be > > > > > > > better > > > > > > > > as > > > > > > > > > >> "Tracks the number of acquisition locks for records > which > > > are > > > > > not > > > > > > > > > >> acknowledged within the timeout." > > > > > > > > > >> > > > > > > > > > >> Thanks, > > > > > > > > > >> Andrew > > > > > > > > > >> > > > > > > > > > >> ________________________________________ > > > > > > > > > >> From: Apoorv Mittal <apoorvmitta...@gmail.com> > > > > > > > > > >> Sent: 05 November 2024 08:59 > > > > > > > > > >> To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > > > > > >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for > > > > > > cooperative > > > > > > > > > >> consumption > > > > > > > > > >> > > > > > > > > > >> Hi All, > > > > > > > > > >> Please let me know if there is any other feedback or > > > > > suggestions. > > > > > > > > > >> > > > > > > > > > >> Regards, > > > > > > > > > >> Apoorv Mittal > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> On Fri, Nov 1, 2024 at 5:30 PM Apoorv Mittal < > > > > > > > > apoorvmitta...@gmail.com> > > > > > > > > > >> wrote: > > > > > > > > > >> > > > > > > > > > >> > Thanks Jun and Andrew for reviewing and feedback. > > > > > > > > > >> > > > > > > > > > > >> > J1: I was thinking of having metrics as per the APIs > in > > > > > > > > > >> > SharePartitionManager but I think you are right. This > is > > > > > > something > > > > > > > > not > > > > > > > > > >> > beneficial to track as the release should only happen > > > when a > > > > > > > client > > > > > > > > > >> sends a > > > > > > > > > >> > leave group request. I have removed this metric. > > > > > > > > > >> > > > > > > > > > > >> > J2: Moved to Histogram. > > > > > > > > > >> > > > > > > > > > > >> > J3: Moved to Meter. I have removed > > > > > > > > RequestTopicPartitionsFetchEmptyCount > > > > > > > > > >> > from KIP (as mentioned by Andrew - AS8) as > > > > > > > > > >> RequestTopicPartitionsFetchRatio > > > > > > > > > >> > moved to Histogram. Also I have marked > > > InFlightMessageCount > > > > as > > > > > > > > Meter, so > > > > > > > > > >> > shall capture the rate better. > > > > > > > > > >> > > > > > > > > > > >> > AS0: Done, specified specifically in the Motivation > > > section. > > > > > > > > > >> > > > > > > > > > > >> > AS1: Hmmm, I thought about that earlier and was > thinking > > > > about > > > > > > > > keeping > > > > > > > > > >> the > > > > > > > > > >> > abstracted share fetch APIs metrics in > ShareGroupMetrics > > > > > itself. > > > > > > > > > >> However I > > > > > > > > > >> > think the Share Fetch and Share Acknowledge can be > part > > of > > > > > > > > > >> BrokerTopicMetrics > > > > > > > > > >> > as these operations happen by Share Consumer. Do you > > think > > > > we > > > > > > > should > > > > > > > > > >> also > > > > > > > > > >> > move the RequestTopicPartitionsFetchRatio metric? I > > think > > > we > > > > > > > > shouldn't > > > > > > > > > >> and > > > > > > > > > >> > it should be in ShareGroupMetrics itself, please let > me > > > know > > > > > > what > > > > > > > > you > > > > > > > > > >> think. > > > > > > > > > >> > > > > > > > > > > >> > AS2: Yes you are right, it can support all topic rate > as > > > > > well. I > > > > > > > > have > > > > > > > > > >> > added the suggestion. > > > > > > > > > >> > > > > > > > > > > >> > AS3: I have removed this metric as part of J1 comment. > > > > > > > > > >> > > > > > > > > > > >> > AS4,AS6: We already have a share-acknowledgement Meter > > > > metric > > > > > > > > defined in > > > > > > > > > >> > KIP-932. I am of the opinion to move the KIP-932 > > > > > > > > share-acknowledgement > > > > > > > > > >> > metric to TotalShareAcknowledgementRequestsPerSec. > > Hence I > > > > > have > > > > > > > > added > > > > > > > > > >> > corresponding FailedShareAcknowledgementRequestsPerSec > > in > > > > this > > > > > > > KIP. > > > > > > > > > >> > > > > > > > > > > >> > AS5: It counts the failure of a share acknowledge > > request. > > > > > > Similar > > > > > > > > to > > > > > > > > > >> > multi topic partition fetch, acknowledge can also be > > multi > > > > > topic > > > > > > > > > >> partition > > > > > > > > > >> > but a single acknowledge request might either fail > > > > completely > > > > > or > > > > > > > at > > > > > > > > > >> topic > > > > > > > > > >> > partition level. Hence this metric tracks the failure > as > > > > > similar > > > > > > > to > > > > > > > > > >> > FailedFetchRequestsPerSec or > > > FailedShareFetchRequestsPerSec. > > > > > > > > > >> > > > > > > > > > > >> > AS7: I have added some examples and another metric > > > > > > > FetchLockTimeMs, > > > > > > > > > >> which > > > > > > > > > >> > should help in optimizing behaviour. > > > > > > > > > >> > > > > > > > > > > >> > AS8: Yes, we might be able to work with low > percentiles > > to > > > > > > > > investigate > > > > > > > > > >> > further hence I have removed > > > > > > > RequestTopicPartitionsFetchEmptyCount. > > > > > > > > > >> > However Histogram will give us percentiles but not the > > raw > > > > > > > bucketed > > > > > > > > > >> > value, in case we think we need such data explicitly > > then > > > > > might > > > > > > > > consider > > > > > > > > > >> > adding it but for now it's removed. > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > Regards, > > > > > > > > > >> > Apoorv Mittal > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > On Thu, Oct 31, 2024 at 1:03 PM Andrew Schofield < > > > > > > > > > >> > andrew_schofield_j...@outlook.com> wrote: > > > > > > > > > >> > > > > > > > > > > >> >> Hi Apoorv, > > > > > > > > > >> >> Thanks for the KIP. I have some comments. > > > > > > > > > >> >> > > > > > > > > > >> >> AS0: These are broker metrics and not client > metrics. I > > > > don't > > > > > > > > think the > > > > > > > > > >> >> KIP states that and it probably should. > > > > > > > > > >> >> > > > > > > > > > >> >> AS1: TotalShareFetchRequestsPerSec is inspired by the > > > > > existing > > > > > > > > metric > > > > > > > > > >> >> TotalFetchRequestsPerSec. That one is in the > > > > > BrokerTopicMetrics > > > > > > > > group > > > > > > > > > >> >> so I would expect the new metric to be in the same > > group. > > > > > This > > > > > > > > comment > > > > > > > > > >> >> applies to all of the ShareGroupMetrics broker > metrics > > > > scoped > > > > > > to > > > > > > > a > > > > > > > > > >> topic. > > > > > > > > > >> >> > > > > > > > > > >> >> AS2: Similarly to TotalFetchRequestPerSec, does the > new > > > > > metric > > > > > > > > support > > > > > > > > > >> >> omitting the topic tag in order to obtain the > all-topic > > > > > rate? I > > > > > > > > > >> believe so > > > > > > > > > >> >> because it follows the existing metrics. I guess the > > tag > > > is > > > > > > > > actually > > > > > > > > > >> >> `topic=([-.\w]+)` to copy the existing notation. > > > > > > > > > >> >> > > > > > > > > > >> >> AS3: What's a share release? Does it count releases > > which > > > > are > > > > > > > made > > > > > > > > > >> >> by ShareFetch/Acknowledge requests as well as > implicit > > > ones > > > > > as > > > > > > a > > > > > > > > > >> >> result of share session expiration or closure? > > > > > > > > > >> >> > > > > > > > > > >> >> AS4: I see there is > > > > FailedShareAcknowledgementRequestsPerSec. > > > > > > > > > >> >> If this corresponds specifically to the > > ShareAcknowledge > > > > API, > > > > > > > then > > > > > > > > > >> >> the name should be > > FailedShareAcknowledgeRequestsPerSec. > > > > > > > > > >> >> > > > > > > > > > >> >> AS5: What does > FailedShareAcknowledgementRequestsPerSec > > > > > > > > > >> >> actually count? This is a per-topic metric and a > > > > > > ShareAcknowledge > > > > > > > > > >> >> can mention several topics, but actually it can also > > > > operate > > > > > on > > > > > > > > many > > > > > > > > > >> >> acquired records specifying any of the different > > > > > > acknowledgement > > > > > > > > > >> >> types that KIP-932 defines. > > > > > > > > > >> >> > > > > > > > > > >> >> AS6: Why no TotalShareAcknowledgementRequestsPerSec? > > > > > > > > > >> >> > > > > > > > > > >> >> AS7: I think I like the concept behind > > > > > > > > > >> RequestTopicPartitionsFetchRatio, > > > > > > > > > >> >> but it would really help to include some examples of > > how > > > it > > > > > > > works. > > > > > > > > You > > > > > > > > > >> >> mention 4 potential causes for low ratios, so it > would > > be > > > > > nice > > > > > > to > > > > > > > > see > > > > > > > > > >> how > > > > > > > > > >> >> each of them would be evident from the metrics and > > > > > > > configurations. > > > > > > > > > >> >> This is going to be valuable information for > operating > > > this > > > > > in > > > > > > > > > >> production. > > > > > > > > > >> >> > > > > > > > > > >> >> AS8: As Jun mentioned, maybe a histogram would be > > > > appropriate > > > > > > > > > >> >> for RequestTopicPartitionsFetchRatio. Then I wonder > > > whether > > > > > > it's > > > > > > > > > >> >> worth having a separate > > > > RequestTopicPartitionsFetchEmptyCount > > > > > > > since > > > > > > > > > >> >> those are requests for which the fetch ratio was 0. > > > > > > > > > >> >> > > > > > > > > > >> >> Thanks, > > > > > > > > > >> >> Andrew > > > > > > > > > >> >> ________________________________________ > > > > > > > > > >> >> From: Jun Rao <j...@confluent.io.INVALID> > > > > > > > > > >> >> Sent: 30 October 2024 23:07 > > > > > > > > > >> >> To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > > > > > >> >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics > for > > > > > > > cooperative > > > > > > > > > >> >> consumption > > > > > > > > > >> >> > > > > > > > > > >> >> Hi, Apoorv, > > > > > > > > > >> >> > > > > > > > > > >> >> Thanks for the KIP. A few comments. > > > > > > > > > >> >> > > > > > > > > > >> >> 1. TotalShareReleaseRequestsPerSec: This is a bit > weird > > > > since > > > > > > > > there is > > > > > > > > > >> >> no ShareReleaseRequest. > > > > > > > > > >> >> > > > > > > > > > >> >> 2. RequestTopicPartitionsFetchRatio and > > > > > > > InFlightBatchMessageCount: > > > > > > > > It > > > > > > > > > >> >> seems > > > > > > > > > >> >> Histogram is more appropriate for them. > > > > > > > > > >> >> > > > > > > > > > >> >> 3. RequestTopicPartitionsFetchEmptyCount and > > > > > > > > > >> AcquisitionLockTimeoutCount: > > > > > > > > > >> >> The problem with using gauge is that if the value > > changes > > > > > > > quickly, > > > > > > > > we > > > > > > > > > >> may > > > > > > > > > >> >> not be able to capture the occurrence of empty fetch > or > > > > lock > > > > > > > > timeout. > > > > > > > > > >> >> Meter > > > > > > > > > >> >> captures those rare events better. > > > > > > > > > >> >> > > > > > > > > > >> >> Jun > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> On Tue, Oct 29, 2024 at 8:48 AM Apoorv Mittal < > > > > > > > > > >> apoorvmitta...@gmail.com> > > > > > > > > > >> >> wrote: > > > > > > > > > >> >> > > > > > > > > > >> >> > Hi Everyone, > > > > > > > > > >> >> > I would like to start a discussion on KIP-1103: > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1103%3A+Additional+metrics+for+cooperative+consumption > > > > > > > > > >> >> > > > > > > > > > > >> >> > This KIP extends KIP-932 to provide additional > > metrics > > > > for > > > > > > > > > >> >> > Queues/Cooperative consumption. > > > > > > > > > >> >> > > > > > > > > > > >> >> > Regards, > > > > > > > > > >> >> > Apoorv Mittal > > > > > > > > > >> >> > > > > > > > > > > >> >> > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >