Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Qichao Chu Sat, 28 Oct 2023 02:37:53 -0700

Hello Everyone,

Can I ask for some feedback regarding KIP-977
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics>
?


Best,
Qichao Chu
Software Engineer | Data - Kafka
[image: Uber] <https://uber.com/>


On Mon, Oct 16, 2023 at 7:34 PM Qichao Chu <qic...@uber.com> wrote:

> Hi Divij and Kirk,
>
> Thank you both for providing the valuable feedback and sorry for the
> delay. I have just updated the KIP to address the comments.
>
>    1. Instead of using a topic-level control, global verbosity control
>    makes more sense if we want to extend it in the future. It would be very
>    difficult if we want to apply the topic allowlist everywhere
>    2. Also, the topic allowlist was not dynamic which makes everything
>    quite complex, especially for the topic lifecycle management. By using the
>    dynamic global config, debugging could be easier, and management of the
>    config is also made easier.
>    3. More details are included in the test section.
>
> One thing that still misses is the performance numbers. I will get it
> ready with our internal clusters and share out soon.
>
> Many thanks for the review!
> Qichao
>
> On Tue, Sep 12, 2023 at 8:31 AM Kirk True <k...@kirktrue.pro> wrote:
>
>> Oh, and does metrics.partition.level.reporting.topics allow for regex?
>>
>> > On Sep 12, 2023, at 8:26 AM, Kirk True <k...@kirktrue.pro> wrote:
>> >
>> > Hi Qichao,
>> >
>> > Thanks for the KIP!
>> >
>> > Divij—questions/comments inline...
>> >
>> >> On Sep 11, 2023, at 4:32 AM, Divij Vaidya <divijvaidy...@gmail.com>
>> wrote:
>> >>
>> >> Thank you for the proposal Qichao.
>> >>
>> >> I agree with the motivation here and understand the tradeoff here
>> >> between observability vs. increased metric dimensions (metric fan-out
>> >> as you say in the KIP).
>> >>
>> >> High level comments:
>> >>
>> >> 1. I would urge you to consider the extensibility of the proposal for
>> >> other types of metrics. Tomorrow, if we want to selectively add
>> >> "partition" dimension to another metric, would we have to modify the
>> >> code where each metric is emitted? Alternatively, could we abstract
>> >> out this config in a "Kafka Metrics" library. The code provides all
>> >> information about this library and this library can choose which
>> >> dimensions it wants to add to the final metrics that are emitted based
>> >> on declarative configuration.
>> >
>> > I’d agree with this if it doesn’t place a burden on the callers. Are
>> there any potential call sites that don’t have the partition information
>> readily available?
>> >
>> >> 2. Can we offload the handling of this dimension filtering to the
>> >> metric framework? Have you explored whether prometheus or other
>> >> libraries provide the ability to dynamically change dimensions
>> >> associated with metrics?
>> >
>> > I’m not familiar with the downstream metrics providers’ capabilities.
>> This is a greatest common denominator scenario, right? We’d have to be
>> reasonable sure that the heavily used providers *all* support such dynamic
>> filtering.
>> >
>> > Also—and correct me as needed as I’m not familiar with the area—if we
>> relegate partition filtering to a lower layer, we’d still need to store the
>> metric data in memory until it’s flushed, yes? If so, is that overhead of
>> any concern?
>> >
>> >> Implementation level comments:
>> >>
>> >> 1. In the test plan section, please mention what kind of integ and/or
>> >> unit tests will be added and what they will assert. As an example, you
>> >> can add a section, "functionality tests", which would assert that new
>> >> metric config is being respected and another section, "performance
>> >> tests", which could be a system test and assert that no regression
>> >> caused wrt resources occupied by metrics from one version to another.
>> >> 2. Please mention why or why not are we considering dynamically
>> >> setting the configuration (i.e. without a broker restart)? I would
>> >> imagine that the ability to dynamically configure for a specific topic
>> >> will be very useful especially to debug production situations that you
>> >> mention in the motivation.
>> >
>> > +1
>> >
>> >> 3. You mention that we want to start with metrics closely related to
>> >> producer & consumers first, which is fair. Could you please add a
>> >> statement on the work required to extend this to other metrics in
>> >> future?
>> >
>> > +1
>> >
>> >> 4. In the compatibility section, you mention that this change is
>> >> backward compatible. I don't fully understand that. During a version
>> >> upgrade, we will start with an empty list of topics to maintain
>> >> backward compatibility. I assume after the upgrade, we will update the
>> >> new config with topic names that we desire to monitor. But updating
>> >> the config will require a broker restart (a rolling restart since
>> >> config is read-only). We will be in a situation where some brokers are
>> >> sending metrics with a new "partition" dimension and some brokers are
>> >> sending metrics with no partition dimension. Is that acceptable to JMX
>> >> / prometheus collectors? Would it break them? Please clarify how
>> >> upgrades will work in the compatibility section.
>> >> 5. Could you please quantify (with an experiment) the expected perf
>> >> impact of adding the partition dimension? This could be done as part
>> >> of "test plan" section and would serve as a data point for users to
>> >> understand the potential impact if they decide to turn it on.
>> >
>> > Is there some guidance on the level of precision and detail expected
>> when providing the performance numbers in the KIP?
>> >
>> > This notion of proving out the performance impact is important, I
>> agree. Anecdotally, there was another KIP I was following for which
>> performance numbers were requested, as is reasonable. But that caused the
>> KIP to go a bit sideways as a result because it wasn’t able to get
>> consensus on a) the different scenarios to test, and b) the quantitative
>> goal for each. I’m not really sure the rigo(u)r that’s expected at this
>> stage in the development of a new feature.
>> >
>> > Thanks,
>> > Kirk
>> >
>> >>
>> >> --
>> >> Divij Vaidya
>> >>
>> >>
>> >> On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu <qic...@uber.com.invalid>
>> wrote:
>> >>>
>> >>> Hi All,
>> >>>
>> >>> Although this has been discussed many times, I would like to start a
>> new
>> >>> discussion regarding the introduction of partition-level throughput
>> >>> metrics. Please review the KIP and I'm eager to know everyone's
>> thoughts:
>> >>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
>> >>>
>> >>> TL;DR: The KIP proposes to add partition-level throughput metrics and
>> a new
>> >>> configuration to control the fan-out rate.
>> >>>
>> >>> Thank you all for the review and have a nice weekend!
>> >>>
>> >>> Best,
>> >>> Qichao
>>
>>

Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Reply via email to