Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Qichao Chu Wed, 01 Nov 2023 06:07:34 -0700

Hi Divij,

Thank you for the review and the great suggestions, again. I have updated
the corresponding content, can you take another look?
Regarding the KIP-544 style regex, I have added it to the new property too.
It's expanded to include multiple sections for better future extension.


Best,
Qichao Chu
Software Engineer | Data - Kafka
[image: Uber] <https://uber.com/>


On Mon, Oct 30, 2023 at 6:26 PM Divij Vaidya <divijvaidy...@gmail.com>
wrote:

> Hey *Qichao*
>
> Thank you for the update on the KIP. I like the idea of incremental
> delivery and adding which metrics support this verbosity as a later KIP.
> But I also want to ensure that we wouldn't have to change the current
> config when adding that in future. Hence, we need some discussion on it in
> the scope of the KIP.
>
> About the dynamic configuration:
> Do we need to add the "default" mode? I am asking because it may inhibit us
> from adding the allowList option in future. Instead if we could rephrase
> the config as: "metric.verbosity.high" which takes values as a regEx
> (default will be empty), then we wouldn't have to worry about
> future-proofness of this KIP. Notably this is an existing pattern used by
> KIP-544.
> Alternatively, if you choose to stick to the current configuration pattern,
> please provide information on how this config will look like when we add
> allow listing in future.
>
> About the perf test:
> Motivation - The motivation of perf test is to provide users with a hint on
> what perf penalty they can expect and whether default has degraded perf
> (due to additional "empty" labels).
> Dimensions of the test could be - scrape interval, utilization of broker
> (no traffic vs. heavy traffic), number of partitions (small/200 to
> large/2k).
> Things to collect during perf test - number of mbeans registered with JMX,
> CPU, heap utilization
> Expected results - As long as we can prove that there is no additional
> usage (significant) of CPU or heap after this change for the "default
> mode", we should be good. For the "high" mode, we should document the
> expected increase for users but it is not a blocker to implement this KIP.
>
>
> *Kirk*, I have tried to clarify the expectation on performance, does that
> address your question earlier? Also, I am happy with having a Kafka level
> dynamic config that we can use to filter our metric/dimensionality since we
> have a precedence at KIP-544. Hence, my suggestion to push this filtering
> to metric library can be ignored.
>
> --
> Divij Vaidya
>
>
>
> On Sat, Oct 28, 2023 at 11:37 AM Qichao Chu <qic...@uber.com.invalid>
> wrote:
>
> > Hello Everyone,
> >
> > Can I ask for some feedback regarding KIP-977
> > <
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
> > >
> > ?
> >
> > Best,
> > Qichao Chu
> > Software Engineer | Data - Kafka
> > [image: Uber] <https://uber.com/>
> >
> >
> > On Mon, Oct 16, 2023 at 7:34 PM Qichao Chu <qic...@uber.com> wrote:
> >
> > > Hi Divij and Kirk,
> > >
> > > Thank you both for providing the valuable feedback and sorry for the
> > > delay. I have just updated the KIP to address the comments.
> > >
> > >    1. Instead of using a topic-level control, global verbosity control
> > >    makes more sense if we want to extend it in the future. It would be
> > very
> > >    difficult if we want to apply the topic allowlist everywhere
> > >    2. Also, the topic allowlist was not dynamic which makes everything
> > >    quite complex, especially for the topic lifecycle management. By
> > using the
> > >    dynamic global config, debugging could be easier, and management of
> > the
> > >    config is also made easier.
> > >    3. More details are included in the test section.
> > >
> > > One thing that still misses is the performance numbers. I will get it
> > > ready with our internal clusters and share out soon.
> > >
> > > Many thanks for the review!
> > > Qichao
> > >
> > > On Tue, Sep 12, 2023 at 8:31 AM Kirk True <k...@kirktrue.pro> wrote:
> > >
> > >> Oh, and does metrics.partition.level.reporting.topics allow for regex?
> > >>
> > >> > On Sep 12, 2023, at 8:26 AM, Kirk True <k...@kirktrue.pro> wrote:
> > >> >
> > >> > Hi Qichao,
> > >> >
> > >> > Thanks for the KIP!
> > >> >
> > >> > Divij—questions/comments inline...
> > >> >
> > >> >> On Sep 11, 2023, at 4:32 AM, Divij Vaidya <divijvaidy...@gmail.com
> >
> > >> wrote:
> > >> >>
> > >> >> Thank you for the proposal Qichao.
> > >> >>
> > >> >> I agree with the motivation here and understand the tradeoff here
> > >> >> between observability vs. increased metric dimensions (metric
> fan-out
> > >> >> as you say in the KIP).
> > >> >>
> > >> >> High level comments:
> > >> >>
> > >> >> 1. I would urge you to consider the extensibility of the proposal
> for
> > >> >> other types of metrics. Tomorrow, if we want to selectively add
> > >> >> "partition" dimension to another metric, would we have to modify
> the
> > >> >> code where each metric is emitted? Alternatively, could we abstract
> > >> >> out this config in a "Kafka Metrics" library. The code provides all
> > >> >> information about this library and this library can choose which
> > >> >> dimensions it wants to add to the final metrics that are emitted
> > based
> > >> >> on declarative configuration.
> > >> >
> > >> > I’d agree with this if it doesn’t place a burden on the callers. Are
> > >> there any potential call sites that don’t have the partition
> information
> > >> readily available?
> > >> >
> > >> >> 2. Can we offload the handling of this dimension filtering to the
> > >> >> metric framework? Have you explored whether prometheus or other
> > >> >> libraries provide the ability to dynamically change dimensions
> > >> >> associated with metrics?
> > >> >
> > >> > I’m not familiar with the downstream metrics providers’
> capabilities.
> > >> This is a greatest common denominator scenario, right? We’d have to be
> > >> reasonable sure that the heavily used providers *all* support such
> > dynamic
> > >> filtering.
> > >> >
> > >> > Also—and correct me as needed as I’m not familiar with the area—if
> we
> > >> relegate partition filtering to a lower layer, we’d still need to
> store
> > the
> > >> metric data in memory until it’s flushed, yes? If so, is that overhead
> > of
> > >> any concern?
> > >> >
> > >> >> Implementation level comments:
> > >> >>
> > >> >> 1. In the test plan section, please mention what kind of integ
> and/or
> > >> >> unit tests will be added and what they will assert. As an example,
> > you
> > >> >> can add a section, "functionality tests", which would assert that
> new
> > >> >> metric config is being respected and another section, "performance
> > >> >> tests", which could be a system test and assert that no regression
> > >> >> caused wrt resources occupied by metrics from one version to
> another.
> > >> >> 2. Please mention why or why not are we considering dynamically
> > >> >> setting the configuration (i.e. without a broker restart)? I would
> > >> >> imagine that the ability to dynamically configure for a specific
> > topic
> > >> >> will be very useful especially to debug production situations that
> > you
> > >> >> mention in the motivation.
> > >> >
> > >> > +1
> > >> >
> > >> >> 3. You mention that we want to start with metrics closely related
> to
> > >> >> producer & consumers first, which is fair. Could you please add a
> > >> >> statement on the work required to extend this to other metrics in
> > >> >> future?
> > >> >
> > >> > +1
> > >> >
> > >> >> 4. In the compatibility section, you mention that this change is
> > >> >> backward compatible. I don't fully understand that. During a
> version
> > >> >> upgrade, we will start with an empty list of topics to maintain
> > >> >> backward compatibility. I assume after the upgrade, we will update
> > the
> > >> >> new config with topic names that we desire to monitor. But updating
> > >> >> the config will require a broker restart (a rolling restart since
> > >> >> config is read-only). We will be in a situation where some brokers
> > are
> > >> >> sending metrics with a new "partition" dimension and some brokers
> are
> > >> >> sending metrics with no partition dimension. Is that acceptable to
> > JMX
> > >> >> / prometheus collectors? Would it break them? Please clarify how
> > >> >> upgrades will work in the compatibility section.
> > >> >> 5. Could you please quantify (with an experiment) the expected perf
> > >> >> impact of adding the partition dimension? This could be done as
> part
> > >> >> of "test plan" section and would serve as a data point for users to
> > >> >> understand the potential impact if they decide to turn it on.
> > >> >
> > >> > Is there some guidance on the level of precision and detail expected
> > >> when providing the performance numbers in the KIP?
> > >> >
> > >> > This notion of proving out the performance impact is important, I
> > >> agree. Anecdotally, there was another KIP I was following for which
> > >> performance numbers were requested, as is reasonable. But that caused
> > the
> > >> KIP to go a bit sideways as a result because it wasn’t able to get
> > >> consensus on a) the different scenarios to test, and b) the
> quantitative
> > >> goal for each. I’m not really sure the rigo(u)r that’s expected at
> this
> > >> stage in the development of a new feature.
> > >> >
> > >> > Thanks,
> > >> > Kirk
> > >> >
> > >> >>
> > >> >> --
> > >> >> Divij Vaidya
> > >> >>
> > >> >>
> > >> >> On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu <qic...@uber.com.invalid
> >
> > >> wrote:
> > >> >>>
> > >> >>> Hi All,
> > >> >>>
> > >> >>> Although this has been discussed many times, I would like to
> start a
> > >> new
> > >> >>> discussion regarding the introduction of partition-level
> throughput
> > >> >>> metrics. Please review the KIP and I'm eager to know everyone's
> > >> thoughts:
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
> > >> >>>
> > >> >>> TL;DR: The KIP proposes to add partition-level throughput metrics
> > and
> > >> a new
> > >> >>> configuration to control the fan-out rate.
> > >> >>>
> > >> >>> Thank you all for the review and have a nice weekend!
> > >> >>>
> > >> >>> Best,
> > >> >>> Qichao
> > >>
> > >>
> >
>

Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Reply via email to