Sure Jun, My thought process was regarding if there exists a subset of fetch requests which takes considerably higher fetch time than others, then Meter will not reflect that but Histogram can. However I agree on the aggregation part and moved the metric to Meter.
Regards, Apoorv Mittal On Wed, Nov 13, 2024 at 7:23 PM Jun Rao <j...@confluent.io.invalid> wrote: > Hi, Apoorv, > > There is some difference between the two. For > RequestTopicPartitionsFetchRatio, we measure a ratio per request and it > makes sense to aggregate them as a histogram at the group level. > For FetchLockRatio, if a request locks a partition for a certain amount of > time, its impact is directly reflected in FetchLockRatio. So, in some > sense, FetchLockRatio is already the result of aggregation across different > requests and there is no need for further aggregation. > > Thanks, > > Jun > > On Wed, Nov 13, 2024 at 10:41 AM Apoorv Mittal <apoorvmitta...@gmail.com> > wrote: > > > HI Jun, > > I think it would be worth to see the distribution as well for > > FetchLockRatio as we have the distribution captured for > > RequestTopicPartitionsFetchRatio. The latter has been already marked as > > Histogram, per your earlier suggestion. > > > > Regards, > > Apoorv Mittal > > > > > > On Wed, Nov 13, 2024 at 5:53 PM Jun Rao <j...@confluent.io.invalid> > wrote: > > > > > Hi, Apoorv, > > > > > > Thanks for the updated KIP. Just one minor comment. > > > > > > Why is FetchLockRatio a Histogram? Currently, all ratios are measured > as > > > Meter. > > > > > > Jun > > > > > > On Thu, Nov 7, 2024 at 4:05 PM Apoorv Mittal <apoorvmitta...@gmail.com > > > > > wrote: > > > > > > > Hi Jun, > > > > > > > > 11. Got it, agree. I was thinking either to have FetchLockRatio or > > > > FetchLockIdleRatio, though both of them should convey the same > > > > information. I have added FetchLockRatio as per your suggestion. > > > > Also I have added TopicPartitionsAcquireTimeMs which can track the > time > > > > spent by any share fetch request to acquire any share partition. It's > > > hard > > > > to track the amount of time spent on waiting for a single share > > partition > > > > but we can track the time spent to acquire any. > > > > > > > > 14. Corrected, thanks. > > > > > > > > Regards, > > > > Apoorv Mittal > > > > > > > > > > > > On Thu, Nov 7, 2024 at 10:16 PM Jun Rao <j...@confluent.io.invalid> > > > wrote: > > > > > > > > > Hi, Apoorv, > > > > > > > > > > 11. I was thinking that it would be useful to know if there is any > > > > tension > > > > > when acquiring the share partition lock. This is not quite captured > > by > > > > > FetchLockTimeMs. Perhaps we could add another metric that measures > > the > > > > > amount of time that a request needs to wait before acquiring a > lock. > > > > > > > > > > 14. Should we change share-partition={group-topic-partition} now > that > > > we > > > > > have separate tags for group, topic and partition? > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Thu, Nov 7, 2024 at 12:51 PM Apoorv Mittal < > > > apoorvmitta...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi Jun and Chia, > > > > > > Thanks for the feedback. > > > > > > > > > > > > 10. Done. > > > > > > > > > > > > 11. I have moved FetchLockTimeMs to Histogram. I didn't > understand > > > > > > which ratio shall FetchLockRatio Metric should hold i.e. against > > what > > > > > value > > > > > > we should compare the partitions locked. Though I have added > Metric > > > > > > RequestTopicPartitionsFetchRatio which provides a ratio of > > partition > > > > > locked > > > > > > vs requested. Shall we have something else as well? > > > > > > > > > > > > 12. Yeah, true. Done. > > > > > > > > > > > > AS0: Make sense, I have moved to group, topic, partition tags as > > like > > > > > > Kafka.cluster Partition metrics. > > > > > > > > > > > > AS1: Successful AcknowledgementRequests are also important and > they > > > are > > > > > > already defined in KIP-932 broker metrics. We shall change the > > metric > > > > > name > > > > > > for successful AcknowledgementRequests in KIP-932 itself while > > > > > implementing > > > > > > these metrics. > > > > > > > > > > > > Regards, > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > On Thu, Nov 7, 2024 at 7:36 PM Chia-Ping Tsai < > chia7...@apache.org > > > > > > > > wrote: > > > > > > > > > > > > > hi Apoorv > > > > > > > > > > > > > > sorry for late response: > > > > > > > > > > > > > > AS0: How can users distinguish between the group and topic in > the > > > > > format > > > > > > > share-partition={group-topic-partition} since `-` is a valid > > > > character > > > > > in > > > > > > > both group and topic? > > > > > > > > > > > > > > AS1: Why do we only have metrics for failed > > > AcknowledgementRequests? > > > > > > > Aren't successful AcknowledgementRequests also important for > > > metrics? > > > > > > > > > > > > > > Thanks, > > > > > > > Chia-Ping > > > > > > > > > > > > > > On 2024/11/06 13:58:08 Apoorv Mittal wrote: > > > > > > > > Hi All, > > > > > > > > Please let me know if there is any other feedback. Else will > > > start > > > > a > > > > > > > voting > > > > > > > > thread tomorrow. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 5, 2024 at 12:21 PM Apoorv Mittal < > > > > > > apoorvmitta...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks for the feedback Andrew. > > > > > > > > > > > > > > > > > > AS9: Yes, you are correct it's topic name and not topic id. > > > > > > > > > > > > > > > > > > AS10: Thanks, I have updated the description. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 5, 2024 at 10:56 AM Andrew Schofield < > > > > > > > > > andrew_schofield_j...@outlook.com> wrote: > > > > > > > > > > > > > > > > > >> Hi Apoorv, > > > > > > > > >> A couple of additional comments. > > > > > > > > >> > > > > > > > > >> AS9: The tag share-partition : {group-topic-partition} > > confirm > > > > the > > > > > > > > >> format. I suppose it's > > > > > > > > >> one tag `group-id:topic-name:partition`, and not topic ID. > > > Topic > > > > > ID > > > > > > is > > > > > > > > >> clearly more > > > > > > > > >> authoritative, but I don't think it's usable in metrics. > > > > > > > > >> > > > > > > > > >> AS10: The description of the AcquisitionLockTimeoutMs > would > > be > > > > > > better > > > > > > > as > > > > > > > > >> "Tracks the number of acquisition locks for records which > > are > > > > not > > > > > > > > >> acknowledged within the timeout." > > > > > > > > >> > > > > > > > > >> Thanks, > > > > > > > > >> Andrew > > > > > > > > >> > > > > > > > > >> ________________________________________ > > > > > > > > >> From: Apoorv Mittal <apoorvmitta...@gmail.com> > > > > > > > > >> Sent: 05 November 2024 08:59 > > > > > > > > >> To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > > > > >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for > > > > > cooperative > > > > > > > > >> consumption > > > > > > > > >> > > > > > > > > >> Hi All, > > > > > > > > >> Please let me know if there is any other feedback or > > > > suggestions. > > > > > > > > >> > > > > > > > > >> Regards, > > > > > > > > >> Apoorv Mittal > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> On Fri, Nov 1, 2024 at 5:30 PM Apoorv Mittal < > > > > > > > apoorvmitta...@gmail.com> > > > > > > > > >> wrote: > > > > > > > > >> > > > > > > > > >> > Thanks Jun and Andrew for reviewing and feedback. > > > > > > > > >> > > > > > > > > > >> > J1: I was thinking of having metrics as per the APIs in > > > > > > > > >> > SharePartitionManager but I think you are right. This is > > > > > something > > > > > > > not > > > > > > > > >> > beneficial to track as the release should only happen > > when a > > > > > > client > > > > > > > > >> sends a > > > > > > > > >> > leave group request. I have removed this metric. > > > > > > > > >> > > > > > > > > > >> > J2: Moved to Histogram. > > > > > > > > >> > > > > > > > > > >> > J3: Moved to Meter. I have removed > > > > > > > RequestTopicPartitionsFetchEmptyCount > > > > > > > > >> > from KIP (as mentioned by Andrew - AS8) as > > > > > > > > >> RequestTopicPartitionsFetchRatio > > > > > > > > >> > moved to Histogram. Also I have marked > > InFlightMessageCount > > > as > > > > > > > Meter, so > > > > > > > > >> > shall capture the rate better. > > > > > > > > >> > > > > > > > > > >> > AS0: Done, specified specifically in the Motivation > > section. > > > > > > > > >> > > > > > > > > > >> > AS1: Hmmm, I thought about that earlier and was thinking > > > about > > > > > > > keeping > > > > > > > > >> the > > > > > > > > >> > abstracted share fetch APIs metrics in ShareGroupMetrics > > > > itself. > > > > > > > > >> However I > > > > > > > > >> > think the Share Fetch and Share Acknowledge can be part > of > > > > > > > > >> BrokerTopicMetrics > > > > > > > > >> > as these operations happen by Share Consumer. Do you > think > > > we > > > > > > should > > > > > > > > >> also > > > > > > > > >> > move the RequestTopicPartitionsFetchRatio metric? I > think > > we > > > > > > > shouldn't > > > > > > > > >> and > > > > > > > > >> > it should be in ShareGroupMetrics itself, please let me > > know > > > > > what > > > > > > > you > > > > > > > > >> think. > > > > > > > > >> > > > > > > > > > >> > AS2: Yes you are right, it can support all topic rate as > > > > well. I > > > > > > > have > > > > > > > > >> > added the suggestion. > > > > > > > > >> > > > > > > > > > >> > AS3: I have removed this metric as part of J1 comment. > > > > > > > > >> > > > > > > > > > >> > AS4,AS6: We already have a share-acknowledgement Meter > > > metric > > > > > > > defined in > > > > > > > > >> > KIP-932. I am of the opinion to move the KIP-932 > > > > > > > share-acknowledgement > > > > > > > > >> > metric to TotalShareAcknowledgementRequestsPerSec. > Hence I > > > > have > > > > > > > added > > > > > > > > >> > corresponding FailedShareAcknowledgementRequestsPerSec > in > > > this > > > > > > KIP. > > > > > > > > >> > > > > > > > > > >> > AS5: It counts the failure of a share acknowledge > request. > > > > > Similar > > > > > > > to > > > > > > > > >> > multi topic partition fetch, acknowledge can also be > multi > > > > topic > > > > > > > > >> partition > > > > > > > > >> > but a single acknowledge request might either fail > > > completely > > > > or > > > > > > at > > > > > > > > >> topic > > > > > > > > >> > partition level. Hence this metric tracks the failure as > > > > similar > > > > > > to > > > > > > > > >> > FailedFetchRequestsPerSec or > > FailedShareFetchRequestsPerSec. > > > > > > > > >> > > > > > > > > > >> > AS7: I have added some examples and another metric > > > > > > FetchLockTimeMs, > > > > > > > > >> which > > > > > > > > >> > should help in optimizing behaviour. > > > > > > > > >> > > > > > > > > > >> > AS8: Yes, we might be able to work with low percentiles > to > > > > > > > investigate > > > > > > > > >> > further hence I have removed > > > > > > RequestTopicPartitionsFetchEmptyCount. > > > > > > > > >> > However Histogram will give us percentiles but not the > raw > > > > > > bucketed > > > > > > > > >> > value, in case we think we need such data explicitly > then > > > > might > > > > > > > consider > > > > > > > > >> > adding it but for now it's removed. > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > Regards, > > > > > > > > >> > Apoorv Mittal > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > On Thu, Oct 31, 2024 at 1:03 PM Andrew Schofield < > > > > > > > > >> > andrew_schofield_j...@outlook.com> wrote: > > > > > > > > >> > > > > > > > > > >> >> Hi Apoorv, > > > > > > > > >> >> Thanks for the KIP. I have some comments. > > > > > > > > >> >> > > > > > > > > >> >> AS0: These are broker metrics and not client metrics. I > > > don't > > > > > > > think the > > > > > > > > >> >> KIP states that and it probably should. > > > > > > > > >> >> > > > > > > > > >> >> AS1: TotalShareFetchRequestsPerSec is inspired by the > > > > existing > > > > > > > metric > > > > > > > > >> >> TotalFetchRequestsPerSec. That one is in the > > > > BrokerTopicMetrics > > > > > > > group > > > > > > > > >> >> so I would expect the new metric to be in the same > group. > > > > This > > > > > > > comment > > > > > > > > >> >> applies to all of the ShareGroupMetrics broker metrics > > > scoped > > > > > to > > > > > > a > > > > > > > > >> topic. > > > > > > > > >> >> > > > > > > > > >> >> AS2: Similarly to TotalFetchRequestPerSec, does the new > > > > metric > > > > > > > support > > > > > > > > >> >> omitting the topic tag in order to obtain the all-topic > > > > rate? I > > > > > > > > >> believe so > > > > > > > > >> >> because it follows the existing metrics. I guess the > tag > > is > > > > > > > actually > > > > > > > > >> >> `topic=([-.\w]+)` to copy the existing notation. > > > > > > > > >> >> > > > > > > > > >> >> AS3: What's a share release? Does it count releases > which > > > are > > > > > > made > > > > > > > > >> >> by ShareFetch/Acknowledge requests as well as implicit > > ones > > > > as > > > > > a > > > > > > > > >> >> result of share session expiration or closure? > > > > > > > > >> >> > > > > > > > > >> >> AS4: I see there is > > > FailedShareAcknowledgementRequestsPerSec. > > > > > > > > >> >> If this corresponds specifically to the > ShareAcknowledge > > > API, > > > > > > then > > > > > > > > >> >> the name should be > FailedShareAcknowledgeRequestsPerSec. > > > > > > > > >> >> > > > > > > > > >> >> AS5: What does FailedShareAcknowledgementRequestsPerSec > > > > > > > > >> >> actually count? This is a per-topic metric and a > > > > > ShareAcknowledge > > > > > > > > >> >> can mention several topics, but actually it can also > > > operate > > > > on > > > > > > > many > > > > > > > > >> >> acquired records specifying any of the different > > > > > acknowledgement > > > > > > > > >> >> types that KIP-932 defines. > > > > > > > > >> >> > > > > > > > > >> >> AS6: Why no TotalShareAcknowledgementRequestsPerSec? > > > > > > > > >> >> > > > > > > > > >> >> AS7: I think I like the concept behind > > > > > > > > >> RequestTopicPartitionsFetchRatio, > > > > > > > > >> >> but it would really help to include some examples of > how > > it > > > > > > works. > > > > > > > You > > > > > > > > >> >> mention 4 potential causes for low ratios, so it would > be > > > > nice > > > > > to > > > > > > > see > > > > > > > > >> how > > > > > > > > >> >> each of them would be evident from the metrics and > > > > > > configurations. > > > > > > > > >> >> This is going to be valuable information for operating > > this > > > > in > > > > > > > > >> production. > > > > > > > > >> >> > > > > > > > > >> >> AS8: As Jun mentioned, maybe a histogram would be > > > appropriate > > > > > > > > >> >> for RequestTopicPartitionsFetchRatio. Then I wonder > > whether > > > > > it's > > > > > > > > >> >> worth having a separate > > > RequestTopicPartitionsFetchEmptyCount > > > > > > since > > > > > > > > >> >> those are requests for which the fetch ratio was 0. > > > > > > > > >> >> > > > > > > > > >> >> Thanks, > > > > > > > > >> >> Andrew > > > > > > > > >> >> ________________________________________ > > > > > > > > >> >> From: Jun Rao <j...@confluent.io.INVALID> > > > > > > > > >> >> Sent: 30 October 2024 23:07 > > > > > > > > >> >> To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > > > > >> >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for > > > > > > cooperative > > > > > > > > >> >> consumption > > > > > > > > >> >> > > > > > > > > >> >> Hi, Apoorv, > > > > > > > > >> >> > > > > > > > > >> >> Thanks for the KIP. A few comments. > > > > > > > > >> >> > > > > > > > > >> >> 1. TotalShareReleaseRequestsPerSec: This is a bit weird > > > since > > > > > > > there is > > > > > > > > >> >> no ShareReleaseRequest. > > > > > > > > >> >> > > > > > > > > >> >> 2. RequestTopicPartitionsFetchRatio and > > > > > > InFlightBatchMessageCount: > > > > > > > It > > > > > > > > >> >> seems > > > > > > > > >> >> Histogram is more appropriate for them. > > > > > > > > >> >> > > > > > > > > >> >> 3. RequestTopicPartitionsFetchEmptyCount and > > > > > > > > >> AcquisitionLockTimeoutCount: > > > > > > > > >> >> The problem with using gauge is that if the value > changes > > > > > > quickly, > > > > > > > we > > > > > > > > >> may > > > > > > > > >> >> not be able to capture the occurrence of empty fetch or > > > lock > > > > > > > timeout. > > > > > > > > >> >> Meter > > > > > > > > >> >> captures those rare events better. > > > > > > > > >> >> > > > > > > > > >> >> Jun > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> On Tue, Oct 29, 2024 at 8:48 AM Apoorv Mittal < > > > > > > > > >> apoorvmitta...@gmail.com> > > > > > > > > >> >> wrote: > > > > > > > > >> >> > > > > > > > > >> >> > Hi Everyone, > > > > > > > > >> >> > I would like to start a discussion on KIP-1103: > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1103%3A+Additional+metrics+for+cooperative+consumption > > > > > > > > >> >> > > > > > > > > > >> >> > This KIP extends KIP-932 to provide additional > metrics > > > for > > > > > > > > >> >> > Queues/Cooperative consumption. > > > > > > > > >> >> > > > > > > > > > >> >> > Regards, > > > > > > > > >> >> > Apoorv Mittal > > > > > > > > >> >> > > > > > > > > > >> >> > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >