Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Qichao Chu Wed, 01 Nov 2023 09:07:33 -0700

Hi Divij,

Thank you for the very quick response and the nice suggestions. I have
updated the KIP with the following thoughts.


1. I checked the Java documentation and it seems the regex engine in utils
<https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/regex/Pattern.html>
is
not 100% compatible with PCRE, though it is very close. I stated
the Java implementation as the requirement since we are most likely to
target a JVM language.
2. Agreed with the filter limitation. For now, let's keep it topic only.
With that in mind, I feel we do have cases where a user wants to list many
topics. Although regex is also possible, an array will make things faster.
This makes me add two options for the topic filter.
3. It seems not many configs are using JSON, this was the intention for me
to use a compound string. However since JSON is used widely in the project,
and given the benefits you mentioned earlier, I tried to make the config a
JSON array. The change is to make it compatible with multi-level settings.

Let me know what you think. Many thanks!

Best,
Qichao Chu
Software Engineer | Data - Kafka
[image: Uber] <https://uber.com/>


On Wed, Nov 1, 2023 at 9:43 PM Divij Vaidya <divijvaidy...@gmail.com> wrote:

> Thank you for making the changes Qichao.
>
> We are now entering in the territory of defining a declarative schema for
> filters. In the new input format, the type is string but we are imposing a
> schema for the string and we should clearly call out the schema. You can
> perhaps choose to adopt a schema such as below:
>
> metricLevel = High | Low (default: Low)
> metricNameRegEx = regEx (default: .*)
> nameOfDimension = string
> dimensionRegEx = regEx
> dimensionFilter = [<nameOfDimension>=<dimensionRegEx>] (default: [])
>
> Final Value schema = "level"=$metricLevel, "name"=$metricNameRegEx,
> $dimensionFilter
>
> Further we need to answer questions such as :
> 1. which regEx format do we support (it should probably be Perl-compatible
> regular expressions (PCRE) because Java's regEx is compatible with it)
> 2. should we restrict the dimensionFilter to at max length 1 and value
> "topic" only for now. Later when we want to expand, we can expand filters
> for other dimensions as well such as partitions.
> 3. if we are coming up with our stringified-schema, why not use json? It
> would save us from building a parsing utility for the schema. (I like it in
> its current format but there is a case to be made for json as well)
> 4. what happens when there are contradictory regEx rules, e.g. a topic
> defined in high as well as low. It is generally solved by defining
> precedence. In our case, we can choose that high has more precedence than
> low.
>
> What do you think?
>
> --
> Divij Vaidya
>
>
>
> On Wed, Nov 1, 2023 at 2:07 PM Qichao Chu <qic...@uber.com.invalid> wrote:
>
> > Hi Divij,
> >
> > Thank you for the review and the great suggestions, again. I have updated
> > the corresponding content, can you take another look?
> > Regarding the KIP-544 style regex, I have added it to the new property
> too.
> > It's expanded to include multiple sections for better future extension.
> >
> > Best,
> > Qichao Chu
> > Software Engineer | Data - Kafka
> > [image: Uber] <https://uber.com/>
> >
> >
> > On Mon, Oct 30, 2023 at 6:26 PM Divij Vaidya <divijvaidy...@gmail.com>
> > wrote:
> >
> > > Hey *Qichao*
> > >
> > > Thank you for the update on the KIP. I like the idea of incremental
> > > delivery and adding which metrics support this verbosity as a later
> KIP.
> > > But I also want to ensure that we wouldn't have to change the current
> > > config when adding that in future. Hence, we need some discussion on it
> > in
> > > the scope of the KIP.
> > >
> > > About the dynamic configuration:
> > > Do we need to add the "default" mode? I am asking because it may
> inhibit
> > us
> > > from adding the allowList option in future. Instead if we could
> rephrase
> > > the config as: "metric.verbosity.high" which takes values as a regEx
> > > (default will be empty), then we wouldn't have to worry about
> > > future-proofness of this KIP. Notably this is an existing pattern used
> by
> > > KIP-544.
> > > Alternatively, if you choose to stick to the current configuration
> > pattern,
> > > please provide information on how this config will look like when we
> add
> > > allow listing in future.
> > >
> > > About the perf test:
> > > Motivation - The motivation of perf test is to provide users with a
> hint
> > on
> > > what perf penalty they can expect and whether default has degraded perf
> > > (due to additional "empty" labels).
> > > Dimensions of the test could be - scrape interval, utilization of
> broker
> > > (no traffic vs. heavy traffic), number of partitions (small/200 to
> > > large/2k).
> > > Things to collect during perf test - number of mbeans registered with
> > JMX,
> > > CPU, heap utilization
> > > Expected results - As long as we can prove that there is no additional
> > > usage (significant) of CPU or heap after this change for the "default
> > > mode", we should be good. For the "high" mode, we should document the
> > > expected increase for users but it is not a blocker to implement this
> > KIP.
> > >
> > >
> > > *Kirk*, I have tried to clarify the expectation on performance, does
> that
> > > address your question earlier? Also, I am happy with having a Kafka
> level
> > > dynamic config that we can use to filter our metric/dimensionality
> since
> > we
> > > have a precedence at KIP-544. Hence, my suggestion to push this
> filtering
> > > to metric library can be ignored.
> > >
> > > --
> > > Divij Vaidya
> > >
> > >
> > >
> > > On Sat, Oct 28, 2023 at 11:37 AM Qichao Chu <qic...@uber.com.invalid>
> > > wrote:
> > >
> > > > Hello Everyone,
> > > >
> > > > Can I ask for some feedback regarding KIP-977
> > > > <
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
> > > > >
> > > > ?
> > > >
> > > > Best,
> > > > Qichao Chu
> > > > Software Engineer | Data - Kafka
> > > > [image: Uber] <https://uber.com/>
> > > >
> > > >
> > > > On Mon, Oct 16, 2023 at 7:34 PM Qichao Chu <qic...@uber.com> wrote:
> > > >
> > > > > Hi Divij and Kirk,
> > > > >
> > > > > Thank you both for providing the valuable feedback and sorry for
> the
> > > > > delay. I have just updated the KIP to address the comments.
> > > > >
> > > > >    1. Instead of using a topic-level control, global verbosity
> > control
> > > > >    makes more sense if we want to extend it in the future. It would
> > be
> > > > very
> > > > >    difficult if we want to apply the topic allowlist everywhere
> > > > >    2. Also, the topic allowlist was not dynamic which makes
> > everything
> > > > >    quite complex, especially for the topic lifecycle management. By
> > > > using the
> > > > >    dynamic global config, debugging could be easier, and management
> > of
> > > > the
> > > > >    config is also made easier.
> > > > >    3. More details are included in the test section.
> > > > >
> > > > > One thing that still misses is the performance numbers. I will get
> it
> > > > > ready with our internal clusters and share out soon.
> > > > >
> > > > > Many thanks for the review!
> > > > > Qichao
> > > > >
> > > > > On Tue, Sep 12, 2023 at 8:31 AM Kirk True <k...@kirktrue.pro>
> wrote:
> > > > >
> > > > >> Oh, and does metrics.partition.level.reporting.topics allow for
> > regex?
> > > > >>
> > > > >> > On Sep 12, 2023, at 8:26 AM, Kirk True <k...@kirktrue.pro>
> wrote:
> > > > >> >
> > > > >> > Hi Qichao,
> > > > >> >
> > > > >> > Thanks for the KIP!
> > > > >> >
> > > > >> > Divij—questions/comments inline...
> > > > >> >
> > > > >> >> On Sep 11, 2023, at 4:32 AM, Divij Vaidya <
> > divijvaidy...@gmail.com
> > > >
> > > > >> wrote:
> > > > >> >>
> > > > >> >> Thank you for the proposal Qichao.
> > > > >> >>
> > > > >> >> I agree with the motivation here and understand the tradeoff
> here
> > > > >> >> between observability vs. increased metric dimensions (metric
> > > fan-out
> > > > >> >> as you say in the KIP).
> > > > >> >>
> > > > >> >> High level comments:
> > > > >> >>
> > > > >> >> 1. I would urge you to consider the extensibility of the
> proposal
> > > for
> > > > >> >> other types of metrics. Tomorrow, if we want to selectively add
> > > > >> >> "partition" dimension to another metric, would we have to
> modify
> > > the
> > > > >> >> code where each metric is emitted? Alternatively, could we
> > abstract
> > > > >> >> out this config in a "Kafka Metrics" library. The code provides
> > all
> > > > >> >> information about this library and this library can choose
> which
> > > > >> >> dimensions it wants to add to the final metrics that are
> emitted
> > > > based
> > > > >> >> on declarative configuration.
> > > > >> >
> > > > >> > I’d agree with this if it doesn’t place a burden on the callers.
> > Are
> > > > >> there any potential call sites that don’t have the partition
> > > information
> > > > >> readily available?
> > > > >> >
> > > > >> >> 2. Can we offload the handling of this dimension filtering to
> the
> > > > >> >> metric framework? Have you explored whether prometheus or other
> > > > >> >> libraries provide the ability to dynamically change dimensions
> > > > >> >> associated with metrics?
> > > > >> >
> > > > >> > I’m not familiar with the downstream metrics providers’
> > > capabilities.
> > > > >> This is a greatest common denominator scenario, right? We’d have
> to
> > be
> > > > >> reasonable sure that the heavily used providers *all* support such
> > > > dynamic
> > > > >> filtering.
> > > > >> >
> > > > >> > Also—and correct me as needed as I’m not familiar with the
> area—if
> > > we
> > > > >> relegate partition filtering to a lower layer, we’d still need to
> > > store
> > > > the
> > > > >> metric data in memory until it’s flushed, yes? If so, is that
> > overhead
> > > > of
> > > > >> any concern?
> > > > >> >
> > > > >> >> Implementation level comments:
> > > > >> >>
> > > > >> >> 1. In the test plan section, please mention what kind of integ
> > > and/or
> > > > >> >> unit tests will be added and what they will assert. As an
> > example,
> > > > you
> > > > >> >> can add a section, "functionality tests", which would assert
> that
> > > new
> > > > >> >> metric config is being respected and another section,
> > "performance
> > > > >> >> tests", which could be a system test and assert that no
> > regression
> > > > >> >> caused wrt resources occupied by metrics from one version to
> > > another.
> > > > >> >> 2. Please mention why or why not are we considering dynamically
> > > > >> >> setting the configuration (i.e. without a broker restart)? I
> > would
> > > > >> >> imagine that the ability to dynamically configure for a
> specific
> > > > topic
> > > > >> >> will be very useful especially to debug production situations
> > that
> > > > you
> > > > >> >> mention in the motivation.
> > > > >> >
> > > > >> > +1
> > > > >> >
> > > > >> >> 3. You mention that we want to start with metrics closely
> related
> > > to
> > > > >> >> producer & consumers first, which is fair. Could you please
> add a
> > > > >> >> statement on the work required to extend this to other metrics
> in
> > > > >> >> future?
> > > > >> >
> > > > >> > +1
> > > > >> >
> > > > >> >> 4. In the compatibility section, you mention that this change
> is
> > > > >> >> backward compatible. I don't fully understand that. During a
> > > version
> > > > >> >> upgrade, we will start with an empty list of topics to maintain
> > > > >> >> backward compatibility. I assume after the upgrade, we will
> > update
> > > > the
> > > > >> >> new config with topic names that we desire to monitor. But
> > updating
> > > > >> >> the config will require a broker restart (a rolling restart
> since
> > > > >> >> config is read-only). We will be in a situation where some
> > brokers
> > > > are
> > > > >> >> sending metrics with a new "partition" dimension and some
> brokers
> > > are
> > > > >> >> sending metrics with no partition dimension. Is that acceptable
> > to
> > > > JMX
> > > > >> >> / prometheus collectors? Would it break them? Please clarify
> how
> > > > >> >> upgrades will work in the compatibility section.
> > > > >> >> 5. Could you please quantify (with an experiment) the expected
> > perf
> > > > >> >> impact of adding the partition dimension? This could be done as
> > > part
> > > > >> >> of "test plan" section and would serve as a data point for
> users
> > to
> > > > >> >> understand the potential impact if they decide to turn it on.
> > > > >> >
> > > > >> > Is there some guidance on the level of precision and detail
> > expected
> > > > >> when providing the performance numbers in the KIP?
> > > > >> >
> > > > >> > This notion of proving out the performance impact is important,
> I
> > > > >> agree. Anecdotally, there was another KIP I was following for
> which
> > > > >> performance numbers were requested, as is reasonable. But that
> > caused
> > > > the
> > > > >> KIP to go a bit sideways as a result because it wasn’t able to get
> > > > >> consensus on a) the different scenarios to test, and b) the
> > > quantitative
> > > > >> goal for each. I’m not really sure the rigo(u)r that’s expected at
> > > this
> > > > >> stage in the development of a new feature.
> > > > >> >
> > > > >> > Thanks,
> > > > >> > Kirk
> > > > >> >
> > > > >> >>
> > > > >> >> --
> > > > >> >> Divij Vaidya
> > > > >> >>
> > > > >> >>
> > > > >> >> On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu
> > <qic...@uber.com.invalid
> > > >
> > > > >> wrote:
> > > > >> >>>
> > > > >> >>> Hi All,
> > > > >> >>>
> > > > >> >>> Although this has been discussed many times, I would like to
> > > start a
> > > > >> new
> > > > >> >>> discussion regarding the introduction of partition-level
> > > throughput
> > > > >> >>> metrics. Please review the KIP and I'm eager to know
> everyone's
> > > > >> thoughts:
> > > > >> >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
> > > > >> >>>
> > > > >> >>> TL;DR: The KIP proposes to add partition-level throughput
> > metrics
> > > > and
> > > > >> a new
> > > > >> >>> configuration to control the fan-out rate.
> > > > >> >>>
> > > > >> >>> Thank you all for the review and have a nice weekend!
> > > > >> >>>
> > > > >> >>> Best,
> > > > >> >>> Qichao
> > > > >>
> > > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Reply via email to