Hi, Apoorv, There is some difference between the two. For RequestTopicPartitionsFetchRatio, we measure a ratio per request and it makes sense to aggregate them as a histogram at the group level. For FetchLockRatio, if a request locks a partition for a certain amount of time, its impact is directly reflected in FetchLockRatio. So, in some sense, FetchLockRatio is already the result of aggregation across different requests and there is no need for further aggregation.
Thanks, Jun On Wed, Nov 13, 2024 at 10:41 AM Apoorv Mittal <apoorvmitta...@gmail.com> wrote: > HI Jun, > I think it would be worth to see the distribution as well for > FetchLockRatio as we have the distribution captured for > RequestTopicPartitionsFetchRatio. The latter has been already marked as > Histogram, per your earlier suggestion. > > Regards, > Apoorv Mittal > > > On Wed, Nov 13, 2024 at 5:53 PM Jun Rao <j...@confluent.io.invalid> wrote: > > > Hi, Apoorv, > > > > Thanks for the updated KIP. Just one minor comment. > > > > Why is FetchLockRatio a Histogram? Currently, all ratios are measured as > > Meter. > > > > Jun > > > > On Thu, Nov 7, 2024 at 4:05 PM Apoorv Mittal <apoorvmitta...@gmail.com> > > wrote: > > > > > Hi Jun, > > > > > > 11. Got it, agree. I was thinking either to have FetchLockRatio or > > > FetchLockIdleRatio, though both of them should convey the same > > > information. I have added FetchLockRatio as per your suggestion. > > > Also I have added TopicPartitionsAcquireTimeMs which can track the time > > > spent by any share fetch request to acquire any share partition. It's > > hard > > > to track the amount of time spent on waiting for a single share > partition > > > but we can track the time spent to acquire any. > > > > > > 14. Corrected, thanks. > > > > > > Regards, > > > Apoorv Mittal > > > > > > > > > On Thu, Nov 7, 2024 at 10:16 PM Jun Rao <j...@confluent.io.invalid> > > wrote: > > > > > > > Hi, Apoorv, > > > > > > > > 11. I was thinking that it would be useful to know if there is any > > > tension > > > > when acquiring the share partition lock. This is not quite captured > by > > > > FetchLockTimeMs. Perhaps we could add another metric that measures > the > > > > amount of time that a request needs to wait before acquiring a lock. > > > > > > > > 14. Should we change share-partition={group-topic-partition} now that > > we > > > > have separate tags for group, topic and partition? > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Thu, Nov 7, 2024 at 12:51 PM Apoorv Mittal < > > apoorvmitta...@gmail.com> > > > > wrote: > > > > > > > > > Hi Jun and Chia, > > > > > Thanks for the feedback. > > > > > > > > > > 10. Done. > > > > > > > > > > 11. I have moved FetchLockTimeMs to Histogram. I didn't understand > > > > > which ratio shall FetchLockRatio Metric should hold i.e. against > what > > > > value > > > > > we should compare the partitions locked. Though I have added Metric > > > > > RequestTopicPartitionsFetchRatio which provides a ratio of > partition > > > > locked > > > > > vs requested. Shall we have something else as well? > > > > > > > > > > 12. Yeah, true. Done. > > > > > > > > > > AS0: Make sense, I have moved to group, topic, partition tags as > like > > > > > Kafka.cluster Partition metrics. > > > > > > > > > > AS1: Successful AcknowledgementRequests are also important and they > > are > > > > > already defined in KIP-932 broker metrics. We shall change the > metric > > > > name > > > > > for successful AcknowledgementRequests in KIP-932 itself while > > > > implementing > > > > > these metrics. > > > > > > > > > > Regards, > > > > > Apoorv Mittal > > > > > > > > > > > > > > > On Thu, Nov 7, 2024 at 7:36 PM Chia-Ping Tsai <chia7...@apache.org > > > > > > wrote: > > > > > > > > > > > hi Apoorv > > > > > > > > > > > > sorry for late response: > > > > > > > > > > > > AS0: How can users distinguish between the group and topic in the > > > > format > > > > > > share-partition={group-topic-partition} since `-` is a valid > > > character > > > > in > > > > > > both group and topic? > > > > > > > > > > > > AS1: Why do we only have metrics for failed > > AcknowledgementRequests? > > > > > > Aren't successful AcknowledgementRequests also important for > > metrics? > > > > > > > > > > > > Thanks, > > > > > > Chia-Ping > > > > > > > > > > > > On 2024/11/06 13:58:08 Apoorv Mittal wrote: > > > > > > > Hi All, > > > > > > > Please let me know if there is any other feedback. Else will > > start > > > a > > > > > > voting > > > > > > > thread tomorrow. > > > > > > > > > > > > > > Regards, > > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 5, 2024 at 12:21 PM Apoorv Mittal < > > > > > apoorvmitta...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > Thanks for the feedback Andrew. > > > > > > > > > > > > > > > > AS9: Yes, you are correct it's topic name and not topic id. > > > > > > > > > > > > > > > > AS10: Thanks, I have updated the description. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Nov 5, 2024 at 10:56 AM Andrew Schofield < > > > > > > > > andrew_schofield_j...@outlook.com> wrote: > > > > > > > > > > > > > > > >> Hi Apoorv, > > > > > > > >> A couple of additional comments. > > > > > > > >> > > > > > > > >> AS9: The tag share-partition : {group-topic-partition} > confirm > > > the > > > > > > > >> format. I suppose it's > > > > > > > >> one tag `group-id:topic-name:partition`, and not topic ID. > > Topic > > > > ID > > > > > is > > > > > > > >> clearly more > > > > > > > >> authoritative, but I don't think it's usable in metrics. > > > > > > > >> > > > > > > > >> AS10: The description of the AcquisitionLockTimeoutMs would > be > > > > > better > > > > > > as > > > > > > > >> "Tracks the number of acquisition locks for records which > are > > > not > > > > > > > >> acknowledged within the timeout." > > > > > > > >> > > > > > > > >> Thanks, > > > > > > > >> Andrew > > > > > > > >> > > > > > > > >> ________________________________________ > > > > > > > >> From: Apoorv Mittal <apoorvmitta...@gmail.com> > > > > > > > >> Sent: 05 November 2024 08:59 > > > > > > > >> To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > > > >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for > > > > cooperative > > > > > > > >> consumption > > > > > > > >> > > > > > > > >> Hi All, > > > > > > > >> Please let me know if there is any other feedback or > > > suggestions. > > > > > > > >> > > > > > > > >> Regards, > > > > > > > >> Apoorv Mittal > > > > > > > >> > > > > > > > >> > > > > > > > >> On Fri, Nov 1, 2024 at 5:30 PM Apoorv Mittal < > > > > > > apoorvmitta...@gmail.com> > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >> > Thanks Jun and Andrew for reviewing and feedback. > > > > > > > >> > > > > > > > > >> > J1: I was thinking of having metrics as per the APIs in > > > > > > > >> > SharePartitionManager but I think you are right. This is > > > > something > > > > > > not > > > > > > > >> > beneficial to track as the release should only happen > when a > > > > > client > > > > > > > >> sends a > > > > > > > >> > leave group request. I have removed this metric. > > > > > > > >> > > > > > > > > >> > J2: Moved to Histogram. > > > > > > > >> > > > > > > > > >> > J3: Moved to Meter. I have removed > > > > > > RequestTopicPartitionsFetchEmptyCount > > > > > > > >> > from KIP (as mentioned by Andrew - AS8) as > > > > > > > >> RequestTopicPartitionsFetchRatio > > > > > > > >> > moved to Histogram. Also I have marked > InFlightMessageCount > > as > > > > > > Meter, so > > > > > > > >> > shall capture the rate better. > > > > > > > >> > > > > > > > > >> > AS0: Done, specified specifically in the Motivation > section. > > > > > > > >> > > > > > > > > >> > AS1: Hmmm, I thought about that earlier and was thinking > > about > > > > > > keeping > > > > > > > >> the > > > > > > > >> > abstracted share fetch APIs metrics in ShareGroupMetrics > > > itself. > > > > > > > >> However I > > > > > > > >> > think the Share Fetch and Share Acknowledge can be part of > > > > > > > >> BrokerTopicMetrics > > > > > > > >> > as these operations happen by Share Consumer. Do you think > > we > > > > > should > > > > > > > >> also > > > > > > > >> > move the RequestTopicPartitionsFetchRatio metric? I think > we > > > > > > shouldn't > > > > > > > >> and > > > > > > > >> > it should be in ShareGroupMetrics itself, please let me > know > > > > what > > > > > > you > > > > > > > >> think. > > > > > > > >> > > > > > > > > >> > AS2: Yes you are right, it can support all topic rate as > > > well. I > > > > > > have > > > > > > > >> > added the suggestion. > > > > > > > >> > > > > > > > > >> > AS3: I have removed this metric as part of J1 comment. > > > > > > > >> > > > > > > > > >> > AS4,AS6: We already have a share-acknowledgement Meter > > metric > > > > > > defined in > > > > > > > >> > KIP-932. I am of the opinion to move the KIP-932 > > > > > > share-acknowledgement > > > > > > > >> > metric to TotalShareAcknowledgementRequestsPerSec. Hence I > > > have > > > > > > added > > > > > > > >> > corresponding FailedShareAcknowledgementRequestsPerSec in > > this > > > > > KIP. > > > > > > > >> > > > > > > > > >> > AS5: It counts the failure of a share acknowledge request. > > > > Similar > > > > > > to > > > > > > > >> > multi topic partition fetch, acknowledge can also be multi > > > topic > > > > > > > >> partition > > > > > > > >> > but a single acknowledge request might either fail > > completely > > > or > > > > > at > > > > > > > >> topic > > > > > > > >> > partition level. Hence this metric tracks the failure as > > > similar > > > > > to > > > > > > > >> > FailedFetchRequestsPerSec or > FailedShareFetchRequestsPerSec. > > > > > > > >> > > > > > > > > >> > AS7: I have added some examples and another metric > > > > > FetchLockTimeMs, > > > > > > > >> which > > > > > > > >> > should help in optimizing behaviour. > > > > > > > >> > > > > > > > > >> > AS8: Yes, we might be able to work with low percentiles to > > > > > > investigate > > > > > > > >> > further hence I have removed > > > > > RequestTopicPartitionsFetchEmptyCount. > > > > > > > >> > However Histogram will give us percentiles but not the raw > > > > > bucketed > > > > > > > >> > value, in case we think we need such data explicitly then > > > might > > > > > > consider > > > > > > > >> > adding it but for now it's removed. > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > Regards, > > > > > > > >> > Apoorv Mittal > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > On Thu, Oct 31, 2024 at 1:03 PM Andrew Schofield < > > > > > > > >> > andrew_schofield_j...@outlook.com> wrote: > > > > > > > >> > > > > > > > > >> >> Hi Apoorv, > > > > > > > >> >> Thanks for the KIP. I have some comments. > > > > > > > >> >> > > > > > > > >> >> AS0: These are broker metrics and not client metrics. I > > don't > > > > > > think the > > > > > > > >> >> KIP states that and it probably should. > > > > > > > >> >> > > > > > > > >> >> AS1: TotalShareFetchRequestsPerSec is inspired by the > > > existing > > > > > > metric > > > > > > > >> >> TotalFetchRequestsPerSec. That one is in the > > > BrokerTopicMetrics > > > > > > group > > > > > > > >> >> so I would expect the new metric to be in the same group. > > > This > > > > > > comment > > > > > > > >> >> applies to all of the ShareGroupMetrics broker metrics > > scoped > > > > to > > > > > a > > > > > > > >> topic. > > > > > > > >> >> > > > > > > > >> >> AS2: Similarly to TotalFetchRequestPerSec, does the new > > > metric > > > > > > support > > > > > > > >> >> omitting the topic tag in order to obtain the all-topic > > > rate? I > > > > > > > >> believe so > > > > > > > >> >> because it follows the existing metrics. I guess the tag > is > > > > > > actually > > > > > > > >> >> `topic=([-.\w]+)` to copy the existing notation. > > > > > > > >> >> > > > > > > > >> >> AS3: What's a share release? Does it count releases which > > are > > > > > made > > > > > > > >> >> by ShareFetch/Acknowledge requests as well as implicit > ones > > > as > > > > a > > > > > > > >> >> result of share session expiration or closure? > > > > > > > >> >> > > > > > > > >> >> AS4: I see there is > > FailedShareAcknowledgementRequestsPerSec. > > > > > > > >> >> If this corresponds specifically to the ShareAcknowledge > > API, > > > > > then > > > > > > > >> >> the name should be FailedShareAcknowledgeRequestsPerSec. > > > > > > > >> >> > > > > > > > >> >> AS5: What does FailedShareAcknowledgementRequestsPerSec > > > > > > > >> >> actually count? This is a per-topic metric and a > > > > ShareAcknowledge > > > > > > > >> >> can mention several topics, but actually it can also > > operate > > > on > > > > > > many > > > > > > > >> >> acquired records specifying any of the different > > > > acknowledgement > > > > > > > >> >> types that KIP-932 defines. > > > > > > > >> >> > > > > > > > >> >> AS6: Why no TotalShareAcknowledgementRequestsPerSec? > > > > > > > >> >> > > > > > > > >> >> AS7: I think I like the concept behind > > > > > > > >> RequestTopicPartitionsFetchRatio, > > > > > > > >> >> but it would really help to include some examples of how > it > > > > > works. > > > > > > You > > > > > > > >> >> mention 4 potential causes for low ratios, so it would be > > > nice > > > > to > > > > > > see > > > > > > > >> how > > > > > > > >> >> each of them would be evident from the metrics and > > > > > configurations. > > > > > > > >> >> This is going to be valuable information for operating > this > > > in > > > > > > > >> production. > > > > > > > >> >> > > > > > > > >> >> AS8: As Jun mentioned, maybe a histogram would be > > appropriate > > > > > > > >> >> for RequestTopicPartitionsFetchRatio. Then I wonder > whether > > > > it's > > > > > > > >> >> worth having a separate > > RequestTopicPartitionsFetchEmptyCount > > > > > since > > > > > > > >> >> those are requests for which the fetch ratio was 0. > > > > > > > >> >> > > > > > > > >> >> Thanks, > > > > > > > >> >> Andrew > > > > > > > >> >> ________________________________________ > > > > > > > >> >> From: Jun Rao <j...@confluent.io.INVALID> > > > > > > > >> >> Sent: 30 October 2024 23:07 > > > > > > > >> >> To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > > > >> >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for > > > > > cooperative > > > > > > > >> >> consumption > > > > > > > >> >> > > > > > > > >> >> Hi, Apoorv, > > > > > > > >> >> > > > > > > > >> >> Thanks for the KIP. A few comments. > > > > > > > >> >> > > > > > > > >> >> 1. TotalShareReleaseRequestsPerSec: This is a bit weird > > since > > > > > > there is > > > > > > > >> >> no ShareReleaseRequest. > > > > > > > >> >> > > > > > > > >> >> 2. RequestTopicPartitionsFetchRatio and > > > > > InFlightBatchMessageCount: > > > > > > It > > > > > > > >> >> seems > > > > > > > >> >> Histogram is more appropriate for them. > > > > > > > >> >> > > > > > > > >> >> 3. RequestTopicPartitionsFetchEmptyCount and > > > > > > > >> AcquisitionLockTimeoutCount: > > > > > > > >> >> The problem with using gauge is that if the value changes > > > > > quickly, > > > > > > we > > > > > > > >> may > > > > > > > >> >> not be able to capture the occurrence of empty fetch or > > lock > > > > > > timeout. > > > > > > > >> >> Meter > > > > > > > >> >> captures those rare events better. > > > > > > > >> >> > > > > > > > >> >> Jun > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> >> On Tue, Oct 29, 2024 at 8:48 AM Apoorv Mittal < > > > > > > > >> apoorvmitta...@gmail.com> > > > > > > > >> >> wrote: > > > > > > > >> >> > > > > > > > >> >> > Hi Everyone, > > > > > > > >> >> > I would like to start a discussion on KIP-1103: > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1103%3A+Additional+metrics+for+cooperative+consumption > > > > > > > >> >> > > > > > > > > >> >> > This KIP extends KIP-932 to provide additional metrics > > for > > > > > > > >> >> > Queues/Cooperative consumption. > > > > > > > >> >> > > > > > > > > >> >> > Regards, > > > > > > > >> >> > Apoorv Mittal > > > > > > > >> >> > > > > > > > > >> >> > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >