Hi, Apoorv, Thanks for the updated KIP. Just one minor comment.
Why is FetchLockRatio a Histogram? Currently, all ratios are measured as Meter. Jun On Thu, Nov 7, 2024 at 4:05 PM Apoorv Mittal <apoorvmitta...@gmail.com> wrote: > Hi Jun, > > 11. Got it, agree. I was thinking either to have FetchLockRatio or > FetchLockIdleRatio, though both of them should convey the same > information. I have added FetchLockRatio as per your suggestion. > Also I have added TopicPartitionsAcquireTimeMs which can track the time > spent by any share fetch request to acquire any share partition. It's hard > to track the amount of time spent on waiting for a single share partition > but we can track the time spent to acquire any. > > 14. Corrected, thanks. > > Regards, > Apoorv Mittal > > > On Thu, Nov 7, 2024 at 10:16 PM Jun Rao <j...@confluent.io.invalid> wrote: > > > Hi, Apoorv, > > > > 11. I was thinking that it would be useful to know if there is any > tension > > when acquiring the share partition lock. This is not quite captured by > > FetchLockTimeMs. Perhaps we could add another metric that measures the > > amount of time that a request needs to wait before acquiring a lock. > > > > 14. Should we change share-partition={group-topic-partition} now that we > > have separate tags for group, topic and partition? > > > > Thanks, > > > > Jun > > > > On Thu, Nov 7, 2024 at 12:51 PM Apoorv Mittal <apoorvmitta...@gmail.com> > > wrote: > > > > > Hi Jun and Chia, > > > Thanks for the feedback. > > > > > > 10. Done. > > > > > > 11. I have moved FetchLockTimeMs to Histogram. I didn't understand > > > which ratio shall FetchLockRatio Metric should hold i.e. against what > > value > > > we should compare the partitions locked. Though I have added Metric > > > RequestTopicPartitionsFetchRatio which provides a ratio of partition > > locked > > > vs requested. Shall we have something else as well? > > > > > > 12. Yeah, true. Done. > > > > > > AS0: Make sense, I have moved to group, topic, partition tags as like > > > Kafka.cluster Partition metrics. > > > > > > AS1: Successful AcknowledgementRequests are also important and they are > > > already defined in KIP-932 broker metrics. We shall change the metric > > name > > > for successful AcknowledgementRequests in KIP-932 itself while > > implementing > > > these metrics. > > > > > > Regards, > > > Apoorv Mittal > > > > > > > > > On Thu, Nov 7, 2024 at 7:36 PM Chia-Ping Tsai <chia7...@apache.org> > > wrote: > > > > > > > hi Apoorv > > > > > > > > sorry for late response: > > > > > > > > AS0: How can users distinguish between the group and topic in the > > format > > > > share-partition={group-topic-partition} since `-` is a valid > character > > in > > > > both group and topic? > > > > > > > > AS1: Why do we only have metrics for failed AcknowledgementRequests? > > > > Aren't successful AcknowledgementRequests also important for metrics? > > > > > > > > Thanks, > > > > Chia-Ping > > > > > > > > On 2024/11/06 13:58:08 Apoorv Mittal wrote: > > > > > Hi All, > > > > > Please let me know if there is any other feedback. Else will start > a > > > > voting > > > > > thread tomorrow. > > > > > > > > > > Regards, > > > > > Apoorv Mittal > > > > > > > > > > > > > > > On Tue, Nov 5, 2024 at 12:21 PM Apoorv Mittal < > > > apoorvmitta...@gmail.com> > > > > > wrote: > > > > > > > > > > > Thanks for the feedback Andrew. > > > > > > > > > > > > AS9: Yes, you are correct it's topic name and not topic id. > > > > > > > > > > > > AS10: Thanks, I have updated the description. > > > > > > > > > > > > Regards, > > > > > > Apoorv Mittal > > > > > > > > > > > > > > > > > > On Tue, Nov 5, 2024 at 10:56 AM Andrew Schofield < > > > > > > andrew_schofield_j...@outlook.com> wrote: > > > > > > > > > > > >> Hi Apoorv, > > > > > >> A couple of additional comments. > > > > > >> > > > > > >> AS9: The tag share-partition : {group-topic-partition} confirm > the > > > > > >> format. I suppose it's > > > > > >> one tag `group-id:topic-name:partition`, and not topic ID. Topic > > ID > > > is > > > > > >> clearly more > > > > > >> authoritative, but I don't think it's usable in metrics. > > > > > >> > > > > > >> AS10: The description of the AcquisitionLockTimeoutMs would be > > > better > > > > as > > > > > >> "Tracks the number of acquisition locks for records which are > not > > > > > >> acknowledged within the timeout." > > > > > >> > > > > > >> Thanks, > > > > > >> Andrew > > > > > >> > > > > > >> ________________________________________ > > > > > >> From: Apoorv Mittal <apoorvmitta...@gmail.com> > > > > > >> Sent: 05 November 2024 08:59 > > > > > >> To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for > > cooperative > > > > > >> consumption > > > > > >> > > > > > >> Hi All, > > > > > >> Please let me know if there is any other feedback or > suggestions. > > > > > >> > > > > > >> Regards, > > > > > >> Apoorv Mittal > > > > > >> > > > > > >> > > > > > >> On Fri, Nov 1, 2024 at 5:30 PM Apoorv Mittal < > > > > apoorvmitta...@gmail.com> > > > > > >> wrote: > > > > > >> > > > > > >> > Thanks Jun and Andrew for reviewing and feedback. > > > > > >> > > > > > > >> > J1: I was thinking of having metrics as per the APIs in > > > > > >> > SharePartitionManager but I think you are right. This is > > something > > > > not > > > > > >> > beneficial to track as the release should only happen when a > > > client > > > > > >> sends a > > > > > >> > leave group request. I have removed this metric. > > > > > >> > > > > > > >> > J2: Moved to Histogram. > > > > > >> > > > > > > >> > J3: Moved to Meter. I have removed > > > > RequestTopicPartitionsFetchEmptyCount > > > > > >> > from KIP (as mentioned by Andrew - AS8) as > > > > > >> RequestTopicPartitionsFetchRatio > > > > > >> > moved to Histogram. Also I have marked InFlightMessageCount as > > > > Meter, so > > > > > >> > shall capture the rate better. > > > > > >> > > > > > > >> > AS0: Done, specified specifically in the Motivation section. > > > > > >> > > > > > > >> > AS1: Hmmm, I thought about that earlier and was thinking about > > > > keeping > > > > > >> the > > > > > >> > abstracted share fetch APIs metrics in ShareGroupMetrics > itself. > > > > > >> However I > > > > > >> > think the Share Fetch and Share Acknowledge can be part of > > > > > >> BrokerTopicMetrics > > > > > >> > as these operations happen by Share Consumer. Do you think we > > > should > > > > > >> also > > > > > >> > move the RequestTopicPartitionsFetchRatio metric? I think we > > > > shouldn't > > > > > >> and > > > > > >> > it should be in ShareGroupMetrics itself, please let me know > > what > > > > you > > > > > >> think. > > > > > >> > > > > > > >> > AS2: Yes you are right, it can support all topic rate as > well. I > > > > have > > > > > >> > added the suggestion. > > > > > >> > > > > > > >> > AS3: I have removed this metric as part of J1 comment. > > > > > >> > > > > > > >> > AS4,AS6: We already have a share-acknowledgement Meter metric > > > > defined in > > > > > >> > KIP-932. I am of the opinion to move the KIP-932 > > > > share-acknowledgement > > > > > >> > metric to TotalShareAcknowledgementRequestsPerSec. Hence I > have > > > > added > > > > > >> > corresponding FailedShareAcknowledgementRequestsPerSec in this > > > KIP. > > > > > >> > > > > > > >> > AS5: It counts the failure of a share acknowledge request. > > Similar > > > > to > > > > > >> > multi topic partition fetch, acknowledge can also be multi > topic > > > > > >> partition > > > > > >> > but a single acknowledge request might either fail completely > or > > > at > > > > > >> topic > > > > > >> > partition level. Hence this metric tracks the failure as > similar > > > to > > > > > >> > FailedFetchRequestsPerSec or FailedShareFetchRequestsPerSec. > > > > > >> > > > > > > >> > AS7: I have added some examples and another metric > > > FetchLockTimeMs, > > > > > >> which > > > > > >> > should help in optimizing behaviour. > > > > > >> > > > > > > >> > AS8: Yes, we might be able to work with low percentiles to > > > > investigate > > > > > >> > further hence I have removed > > > RequestTopicPartitionsFetchEmptyCount. > > > > > >> > However Histogram will give us percentiles but not the raw > > > bucketed > > > > > >> > value, in case we think we need such data explicitly then > might > > > > consider > > > > > >> > adding it but for now it's removed. > > > > > >> > > > > > > >> > > > > > > >> > Regards, > > > > > >> > Apoorv Mittal > > > > > >> > > > > > > >> > > > > > > >> > On Thu, Oct 31, 2024 at 1:03 PM Andrew Schofield < > > > > > >> > andrew_schofield_j...@outlook.com> wrote: > > > > > >> > > > > > > >> >> Hi Apoorv, > > > > > >> >> Thanks for the KIP. I have some comments. > > > > > >> >> > > > > > >> >> AS0: These are broker metrics and not client metrics. I don't > > > > think the > > > > > >> >> KIP states that and it probably should. > > > > > >> >> > > > > > >> >> AS1: TotalShareFetchRequestsPerSec is inspired by the > existing > > > > metric > > > > > >> >> TotalFetchRequestsPerSec. That one is in the > BrokerTopicMetrics > > > > group > > > > > >> >> so I would expect the new metric to be in the same group. > This > > > > comment > > > > > >> >> applies to all of the ShareGroupMetrics broker metrics scoped > > to > > > a > > > > > >> topic. > > > > > >> >> > > > > > >> >> AS2: Similarly to TotalFetchRequestPerSec, does the new > metric > > > > support > > > > > >> >> omitting the topic tag in order to obtain the all-topic > rate? I > > > > > >> believe so > > > > > >> >> because it follows the existing metrics. I guess the tag is > > > > actually > > > > > >> >> `topic=([-.\w]+)` to copy the existing notation. > > > > > >> >> > > > > > >> >> AS3: What's a share release? Does it count releases which are > > > made > > > > > >> >> by ShareFetch/Acknowledge requests as well as implicit ones > as > > a > > > > > >> >> result of share session expiration or closure? > > > > > >> >> > > > > > >> >> AS4: I see there is FailedShareAcknowledgementRequestsPerSec. > > > > > >> >> If this corresponds specifically to the ShareAcknowledge API, > > > then > > > > > >> >> the name should be FailedShareAcknowledgeRequestsPerSec. > > > > > >> >> > > > > > >> >> AS5: What does FailedShareAcknowledgementRequestsPerSec > > > > > >> >> actually count? This is a per-topic metric and a > > ShareAcknowledge > > > > > >> >> can mention several topics, but actually it can also operate > on > > > > many > > > > > >> >> acquired records specifying any of the different > > acknowledgement > > > > > >> >> types that KIP-932 defines. > > > > > >> >> > > > > > >> >> AS6: Why no TotalShareAcknowledgementRequestsPerSec? > > > > > >> >> > > > > > >> >> AS7: I think I like the concept behind > > > > > >> RequestTopicPartitionsFetchRatio, > > > > > >> >> but it would really help to include some examples of how it > > > works. > > > > You > > > > > >> >> mention 4 potential causes for low ratios, so it would be > nice > > to > > > > see > > > > > >> how > > > > > >> >> each of them would be evident from the metrics and > > > configurations. > > > > > >> >> This is going to be valuable information for operating this > in > > > > > >> production. > > > > > >> >> > > > > > >> >> AS8: As Jun mentioned, maybe a histogram would be appropriate > > > > > >> >> for RequestTopicPartitionsFetchRatio. Then I wonder whether > > it's > > > > > >> >> worth having a separate RequestTopicPartitionsFetchEmptyCount > > > since > > > > > >> >> those are requests for which the fetch ratio was 0. > > > > > >> >> > > > > > >> >> Thanks, > > > > > >> >> Andrew > > > > > >> >> ________________________________________ > > > > > >> >> From: Jun Rao <j...@confluent.io.INVALID> > > > > > >> >> Sent: 30 October 2024 23:07 > > > > > >> >> To: dev@kafka.apache.org <dev@kafka.apache.org> > > > > > >> >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for > > > cooperative > > > > > >> >> consumption > > > > > >> >> > > > > > >> >> Hi, Apoorv, > > > > > >> >> > > > > > >> >> Thanks for the KIP. A few comments. > > > > > >> >> > > > > > >> >> 1. TotalShareReleaseRequestsPerSec: This is a bit weird since > > > > there is > > > > > >> >> no ShareReleaseRequest. > > > > > >> >> > > > > > >> >> 2. RequestTopicPartitionsFetchRatio and > > > InFlightBatchMessageCount: > > > > It > > > > > >> >> seems > > > > > >> >> Histogram is more appropriate for them. > > > > > >> >> > > > > > >> >> 3. RequestTopicPartitionsFetchEmptyCount and > > > > > >> AcquisitionLockTimeoutCount: > > > > > >> >> The problem with using gauge is that if the value changes > > > quickly, > > > > we > > > > > >> may > > > > > >> >> not be able to capture the occurrence of empty fetch or lock > > > > timeout. > > > > > >> >> Meter > > > > > >> >> captures those rare events better. > > > > > >> >> > > > > > >> >> Jun > > > > > >> >> > > > > > >> >> > > > > > >> >> On Tue, Oct 29, 2024 at 8:48 AM Apoorv Mittal < > > > > > >> apoorvmitta...@gmail.com> > > > > > >> >> wrote: > > > > > >> >> > > > > > >> >> > Hi Everyone, > > > > > >> >> > I would like to start a discussion on KIP-1103: > > > > > >> >> > > > > > > >> >> > > > > > > >> >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1103%3A+Additional+metrics+for+cooperative+consumption > > > > > >> >> > > > > > > >> >> > This KIP extends KIP-932 to provide additional metrics for > > > > > >> >> > Queues/Cooperative consumption. > > > > > >> >> > > > > > > >> >> > Regards, > > > > > >> >> > Apoorv Mittal > > > > > >> >> > > > > > > >> >> > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >