Thanks for the update, Apoorv.

The KIP looks to me.

Jun

On Wed, Nov 13, 2024 at 11:36 AM Apoorv Mittal <apoorvmitta...@gmail.com>
wrote:

> Sure Jun,
> My thought process was regarding if there exists a subset of fetch requests
> which takes considerably higher fetch time than others, then Meter will not
> reflect that but Histogram can. However I agree on the aggregation part and
> moved the metric to Meter.
>
> Regards,
> Apoorv Mittal
>
>
> On Wed, Nov 13, 2024 at 7:23 PM Jun Rao <j...@confluent.io.invalid> wrote:
>
> > Hi, Apoorv,
> >
> > There is some difference between the two. For
> > RequestTopicPartitionsFetchRatio, we measure a ratio per request and it
> > makes sense to aggregate them as a histogram at the group level.
> > For FetchLockRatio, if a request locks a partition for a certain amount
> of
> > time, its impact is directly reflected in FetchLockRatio. So, in some
> > sense, FetchLockRatio is already the result of aggregation across
> different
> > requests and there is no need for further aggregation.
> >
> > Thanks,
> >
> > Jun
> >
> > On Wed, Nov 13, 2024 at 10:41 AM Apoorv Mittal <apoorvmitta...@gmail.com
> >
> > wrote:
> >
> > > HI Jun,
> > > I think it would be worth to see the distribution as well for
> > > FetchLockRatio as we have the distribution captured for
> > > RequestTopicPartitionsFetchRatio. The latter has been already marked as
> > > Histogram, per your earlier suggestion.
> > >
> > > Regards,
> > > Apoorv Mittal
> > >
> > >
> > > On Wed, Nov 13, 2024 at 5:53 PM Jun Rao <j...@confluent.io.invalid>
> > wrote:
> > >
> > > > Hi, Apoorv,
> > > >
> > > > Thanks for the updated KIP. Just one minor comment.
> > > >
> > > > Why is FetchLockRatio a Histogram? Currently, all ratios are measured
> > as
> > > > Meter.
> > > >
> > > > Jun
> > > >
> > > > On Thu, Nov 7, 2024 at 4:05 PM Apoorv Mittal <
> apoorvmitta...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > 11. Got it, agree. I was thinking either to have FetchLockRatio or
> > > > > FetchLockIdleRatio, though both of them should convey the same
> > > > > information. I have added FetchLockRatio as per your suggestion.
> > > > > Also I have added TopicPartitionsAcquireTimeMs which can track the
> > time
> > > > > spent by any share fetch request to acquire any share partition.
> It's
> > > > hard
> > > > > to track the amount of time spent on waiting for a single share
> > > partition
> > > > > but we can track the time spent to acquire any.
> > > > >
> > > > > 14. Corrected, thanks.
> > > > >
> > > > > Regards,
> > > > > Apoorv Mittal
> > > > >
> > > > >
> > > > > On Thu, Nov 7, 2024 at 10:16 PM Jun Rao <j...@confluent.io.invalid>
> > > > wrote:
> > > > >
> > > > > > Hi, Apoorv,
> > > > > >
> > > > > > 11. I was thinking that it would be useful to know if there is
> any
> > > > > tension
> > > > > > when acquiring the share partition lock. This is not quite
> captured
> > > by
> > > > > > FetchLockTimeMs. Perhaps we could add another metric that
> measures
> > > the
> > > > > > amount of time that a request needs to wait before acquiring a
> > lock.
> > > > > >
> > > > > > 14. Should we change share-partition={group-topic-partition} now
> > that
> > > > we
> > > > > > have separate tags for group, topic and partition?
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > > On Thu, Nov 7, 2024 at 12:51 PM Apoorv Mittal <
> > > > apoorvmitta...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi Jun and Chia,
> > > > > > > Thanks for the feedback.
> > > > > > >
> > > > > > > 10. Done.
> > > > > > >
> > > > > > > 11. I have moved FetchLockTimeMs to Histogram. I didn't
> > understand
> > > > > > > which ratio shall FetchLockRatio Metric should hold i.e.
> against
> > > what
> > > > > > value
> > > > > > > we should compare the partitions locked. Though I have added
> > Metric
> > > > > > > RequestTopicPartitionsFetchRatio which provides a ratio of
> > > partition
> > > > > > locked
> > > > > > > vs requested. Shall we have something else as well?
> > > > > > >
> > > > > > > 12. Yeah, true. Done.
> > > > > > >
> > > > > > > AS0: Make sense, I have moved to group, topic, partition tags
> as
> > > like
> > > > > > > Kafka.cluster Partition metrics.
> > > > > > >
> > > > > > > AS1: Successful AcknowledgementRequests are also important and
> > they
> > > > are
> > > > > > > already defined in KIP-932 broker metrics. We shall change the
> > > metric
> > > > > > name
> > > > > > > for successful AcknowledgementRequests in KIP-932 itself while
> > > > > > implementing
> > > > > > > these metrics.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Apoorv Mittal
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Nov 7, 2024 at 7:36 PM Chia-Ping Tsai <
> > chia7...@apache.org
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > hi Apoorv
> > > > > > > >
> > > > > > > > sorry for late response:
> > > > > > > >
> > > > > > > > AS0: How can users distinguish between the group and topic in
> > the
> > > > > > format
> > > > > > > > share-partition={group-topic-partition} since `-` is a valid
> > > > > character
> > > > > > in
> > > > > > > > both group and topic?
> > > > > > > >
> > > > > > > > AS1: Why do we only have metrics for failed
> > > > AcknowledgementRequests?
> > > > > > > > Aren't successful AcknowledgementRequests also important for
> > > > metrics?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Chia-Ping
> > > > > > > >
> > > > > > > > On 2024/11/06 13:58:08 Apoorv Mittal wrote:
> > > > > > > > > Hi All,
> > > > > > > > > Please let me know if there is any other feedback. Else
> will
> > > > start
> > > > > a
> > > > > > > > voting
> > > > > > > > > thread tomorrow.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Apoorv Mittal
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Nov 5, 2024 at 12:21 PM Apoorv Mittal <
> > > > > > > apoorvmitta...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback Andrew.
> > > > > > > > > >
> > > > > > > > > > AS9: Yes, you are correct it's topic name and not topic
> id.
> > > > > > > > > >
> > > > > > > > > > AS10: Thanks, I have updated the description.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Apoorv Mittal
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Tue, Nov 5, 2024 at 10:56 AM Andrew Schofield <
> > > > > > > > > > andrew_schofield_j...@outlook.com> wrote:
> > > > > > > > > >
> > > > > > > > > >> Hi Apoorv,
> > > > > > > > > >> A couple of additional comments.
> > > > > > > > > >>
> > > > > > > > > >> AS9: The tag share-partition : {group-topic-partition}
> > > confirm
> > > > > the
> > > > > > > > > >> format. I suppose it's
> > > > > > > > > >> one tag `group-id:topic-name:partition`, and not topic
> ID.
> > > > Topic
> > > > > > ID
> > > > > > > is
> > > > > > > > > >> clearly more
> > > > > > > > > >> authoritative, but I don't think it's usable in metrics.
> > > > > > > > > >>
> > > > > > > > > >> AS10: The description of the AcquisitionLockTimeoutMs
> > would
> > > be
> > > > > > > better
> > > > > > > > as
> > > > > > > > > >> "Tracks the number of acquisition locks for records
> which
> > > are
> > > > > not
> > > > > > > > > >> acknowledged within the timeout."
> > > > > > > > > >>
> > > > > > > > > >> Thanks,
> > > > > > > > > >> Andrew
> > > > > > > > > >>
> > > > > > > > > >> ________________________________________
> > > > > > > > > >> From: Apoorv Mittal <apoorvmitta...@gmail.com>
> > > > > > > > > >> Sent: 05 November 2024 08:59
> > > > > > > > > >> To: dev@kafka.apache.org <dev@kafka.apache.org>
> > > > > > > > > >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics for
> > > > > > cooperative
> > > > > > > > > >> consumption
> > > > > > > > > >>
> > > > > > > > > >> Hi All,
> > > > > > > > > >> Please let me know if there is any other feedback or
> > > > > suggestions.
> > > > > > > > > >>
> > > > > > > > > >> Regards,
> > > > > > > > > >> Apoorv Mittal
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> On Fri, Nov 1, 2024 at 5:30 PM Apoorv Mittal <
> > > > > > > > apoorvmitta...@gmail.com>
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Thanks Jun and Andrew for reviewing and feedback.
> > > > > > > > > >> >
> > > > > > > > > >> > J1: I was thinking of having metrics as per the APIs
> in
> > > > > > > > > >> > SharePartitionManager but I think you are right. This
> is
> > > > > > something
> > > > > > > > not
> > > > > > > > > >> > beneficial to track as the release should only happen
> > > when a
> > > > > > > client
> > > > > > > > > >> sends a
> > > > > > > > > >> > leave group request. I have removed this metric.
> > > > > > > > > >> >
> > > > > > > > > >> > J2: Moved to Histogram.
> > > > > > > > > >> >
> > > > > > > > > >> > J3: Moved to Meter. I have removed
> > > > > > > > RequestTopicPartitionsFetchEmptyCount
> > > > > > > > > >> > from KIP (as mentioned by Andrew - AS8) as
> > > > > > > > > >> RequestTopicPartitionsFetchRatio
> > > > > > > > > >> > moved to Histogram. Also I have marked
> > > InFlightMessageCount
> > > > as
> > > > > > > > Meter, so
> > > > > > > > > >> > shall capture the rate better.
> > > > > > > > > >> >
> > > > > > > > > >> > AS0: Done, specified specifically in the Motivation
> > > section.
> > > > > > > > > >> >
> > > > > > > > > >> > AS1: Hmmm, I thought about that earlier and was
> thinking
> > > > about
> > > > > > > > keeping
> > > > > > > > > >> the
> > > > > > > > > >> > abstracted share fetch APIs metrics in
> ShareGroupMetrics
> > > > > itself.
> > > > > > > > > >> However I
> > > > > > > > > >> > think the Share Fetch and Share Acknowledge can be
> part
> > of
> > > > > > > > > >> BrokerTopicMetrics
> > > > > > > > > >> > as these operations happen by Share Consumer. Do you
> > think
> > > > we
> > > > > > > should
> > > > > > > > > >> also
> > > > > > > > > >> > move the RequestTopicPartitionsFetchRatio metric? I
> > think
> > > we
> > > > > > > > shouldn't
> > > > > > > > > >> and
> > > > > > > > > >> > it should be in ShareGroupMetrics itself, please let
> me
> > > know
> > > > > > what
> > > > > > > > you
> > > > > > > > > >> think.
> > > > > > > > > >> >
> > > > > > > > > >> > AS2: Yes you are right, it can support all topic rate
> as
> > > > > well. I
> > > > > > > > have
> > > > > > > > > >> > added the suggestion.
> > > > > > > > > >> >
> > > > > > > > > >> > AS3: I have removed this metric as part of J1 comment.
> > > > > > > > > >> >
> > > > > > > > > >> > AS4,AS6: We already have a share-acknowledgement Meter
> > > > metric
> > > > > > > > defined in
> > > > > > > > > >> > KIP-932. I am of the opinion to move the KIP-932
> > > > > > > > share-acknowledgement
> > > > > > > > > >> > metric to TotalShareAcknowledgementRequestsPerSec.
> > Hence I
> > > > > have
> > > > > > > > added
> > > > > > > > > >> > corresponding FailedShareAcknowledgementRequestsPerSec
> > in
> > > > this
> > > > > > > KIP.
> > > > > > > > > >> >
> > > > > > > > > >> > AS5: It counts the failure of a share acknowledge
> > request.
> > > > > > Similar
> > > > > > > > to
> > > > > > > > > >> > multi topic partition fetch, acknowledge can also be
> > multi
> > > > > topic
> > > > > > > > > >> partition
> > > > > > > > > >> > but a single acknowledge request might either fail
> > > > completely
> > > > > or
> > > > > > > at
> > > > > > > > > >> topic
> > > > > > > > > >> > partition level. Hence this metric tracks the failure
> as
> > > > > similar
> > > > > > > to
> > > > > > > > > >> > FailedFetchRequestsPerSec or
> > > FailedShareFetchRequestsPerSec.
> > > > > > > > > >> >
> > > > > > > > > >> > AS7: I have added some examples and another metric
> > > > > > > FetchLockTimeMs,
> > > > > > > > > >> which
> > > > > > > > > >> > should help in optimizing behaviour.
> > > > > > > > > >> >
> > > > > > > > > >> > AS8: Yes, we might be able to work with low
> percentiles
> > to
> > > > > > > > investigate
> > > > > > > > > >> > further hence I have removed
> > > > > > > RequestTopicPartitionsFetchEmptyCount.
> > > > > > > > > >> > However Histogram will give us percentiles but not the
> > raw
> > > > > > > bucketed
> > > > > > > > > >> > value, in case we think we need such data explicitly
> > then
> > > > > might
> > > > > > > > consider
> > > > > > > > > >> > adding it but for now it's removed.
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > Regards,
> > > > > > > > > >> > Apoorv Mittal
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > On Thu, Oct 31, 2024 at 1:03 PM Andrew Schofield <
> > > > > > > > > >> > andrew_schofield_j...@outlook.com> wrote:
> > > > > > > > > >> >
> > > > > > > > > >> >> Hi Apoorv,
> > > > > > > > > >> >> Thanks for the KIP. I have some comments.
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS0: These are broker metrics and not client
> metrics. I
> > > > don't
> > > > > > > > think the
> > > > > > > > > >> >> KIP states that and it probably should.
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS1: TotalShareFetchRequestsPerSec is inspired by the
> > > > > existing
> > > > > > > > metric
> > > > > > > > > >> >> TotalFetchRequestsPerSec. That one is in the
> > > > > BrokerTopicMetrics
> > > > > > > > group
> > > > > > > > > >> >> so I would expect the new metric to be in the same
> > group.
> > > > > This
> > > > > > > > comment
> > > > > > > > > >> >> applies to all of the ShareGroupMetrics broker
> metrics
> > > > scoped
> > > > > > to
> > > > > > > a
> > > > > > > > > >> topic.
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS2: Similarly to TotalFetchRequestPerSec, does the
> new
> > > > > metric
> > > > > > > > support
> > > > > > > > > >> >> omitting the topic tag in order to obtain the
> all-topic
> > > > > rate? I
> > > > > > > > > >> believe so
> > > > > > > > > >> >> because it follows the existing metrics. I guess the
> > tag
> > > is
> > > > > > > > actually
> > > > > > > > > >> >> `topic=([-.\w]+)` to copy the existing notation.
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS3: What's a share release? Does it count releases
> > which
> > > > are
> > > > > > > made
> > > > > > > > > >> >> by ShareFetch/Acknowledge requests as well as
> implicit
> > > ones
> > > > > as
> > > > > > a
> > > > > > > > > >> >> result of share session expiration or closure?
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS4: I see there is
> > > > FailedShareAcknowledgementRequestsPerSec.
> > > > > > > > > >> >> If this corresponds specifically to the
> > ShareAcknowledge
> > > > API,
> > > > > > > then
> > > > > > > > > >> >> the name should be
> > FailedShareAcknowledgeRequestsPerSec.
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS5: What does
> FailedShareAcknowledgementRequestsPerSec
> > > > > > > > > >> >> actually count? This is a per-topic metric and a
> > > > > > ShareAcknowledge
> > > > > > > > > >> >> can mention several topics, but actually it can also
> > > > operate
> > > > > on
> > > > > > > > many
> > > > > > > > > >> >> acquired records specifying any of the different
> > > > > > acknowledgement
> > > > > > > > > >> >> types that KIP-932 defines.
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS6: Why no TotalShareAcknowledgementRequestsPerSec?
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS7: I think I like the concept behind
> > > > > > > > > >> RequestTopicPartitionsFetchRatio,
> > > > > > > > > >> >> but it would really help to include some examples of
> > how
> > > it
> > > > > > > works.
> > > > > > > > You
> > > > > > > > > >> >> mention 4 potential causes for low ratios, so it
> would
> > be
> > > > > nice
> > > > > > to
> > > > > > > > see
> > > > > > > > > >> how
> > > > > > > > > >> >> each of them would be evident from the metrics and
> > > > > > > configurations.
> > > > > > > > > >> >> This is going to be valuable information for
> operating
> > > this
> > > > > in
> > > > > > > > > >> production.
> > > > > > > > > >> >>
> > > > > > > > > >> >> AS8: As Jun mentioned, maybe a histogram would be
> > > > appropriate
> > > > > > > > > >> >> for RequestTopicPartitionsFetchRatio. Then I wonder
> > > whether
> > > > > > it's
> > > > > > > > > >> >> worth having a separate
> > > > RequestTopicPartitionsFetchEmptyCount
> > > > > > > since
> > > > > > > > > >> >> those are requests for which the fetch ratio was 0.
> > > > > > > > > >> >>
> > > > > > > > > >> >> Thanks,
> > > > > > > > > >> >> Andrew
> > > > > > > > > >> >> ________________________________________
> > > > > > > > > >> >> From: Jun Rao <j...@confluent.io.INVALID>
> > > > > > > > > >> >> Sent: 30 October 2024 23:07
> > > > > > > > > >> >> To: dev@kafka.apache.org <dev@kafka.apache.org>
> > > > > > > > > >> >> Subject: Re: [DISCUSS] KIP-1103: Additional metrics
> for
> > > > > > > cooperative
> > > > > > > > > >> >> consumption
> > > > > > > > > >> >>
> > > > > > > > > >> >> Hi, Apoorv,
> > > > > > > > > >> >>
> > > > > > > > > >> >> Thanks for the KIP. A few comments.
> > > > > > > > > >> >>
> > > > > > > > > >> >> 1. TotalShareReleaseRequestsPerSec: This is a bit
> weird
> > > > since
> > > > > > > > there is
> > > > > > > > > >> >> no ShareReleaseRequest.
> > > > > > > > > >> >>
> > > > > > > > > >> >> 2. RequestTopicPartitionsFetchRatio and
> > > > > > > InFlightBatchMessageCount:
> > > > > > > > It
> > > > > > > > > >> >> seems
> > > > > > > > > >> >> Histogram is more appropriate for them.
> > > > > > > > > >> >>
> > > > > > > > > >> >> 3. RequestTopicPartitionsFetchEmptyCount and
> > > > > > > > > >> AcquisitionLockTimeoutCount:
> > > > > > > > > >> >> The problem with using gauge is that if the value
> > changes
> > > > > > > quickly,
> > > > > > > > we
> > > > > > > > > >> may
> > > > > > > > > >> >> not be able to capture the occurrence of empty fetch
> or
> > > > lock
> > > > > > > > timeout.
> > > > > > > > > >> >> Meter
> > > > > > > > > >> >> captures those rare events better.
> > > > > > > > > >> >>
> > > > > > > > > >> >> Jun
> > > > > > > > > >> >>
> > > > > > > > > >> >>
> > > > > > > > > >> >> On Tue, Oct 29, 2024 at 8:48 AM Apoorv Mittal <
> > > > > > > > > >> apoorvmitta...@gmail.com>
> > > > > > > > > >> >> wrote:
> > > > > > > > > >> >>
> > > > > > > > > >> >> > Hi Everyone,
> > > > > > > > > >> >> > I would like to start a discussion on KIP-1103:
> > > > > > > > > >> >> >
> > > > > > > > > >> >> >
> > > > > > > > > >> >>
> > > > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1103%3A+Additional+metrics+for+cooperative+consumption
> > > > > > > > > >> >> >
> > > > > > > > > >> >> > This KIP extends KIP-932 to provide additional
> > metrics
> > > > for
> > > > > > > > > >> >> > Queues/Cooperative consumption.
> > > > > > > > > >> >> >
> > > > > > > > > >> >> > Regards,
> > > > > > > > > >> >> > Apoorv Mittal
> > > > > > > > > >> >> >
> > > > > > > > > >> >>
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to