Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Kirk True Tue, 12 Sep 2023 08:30:14 -0700
Oh, and does metrics.partition.level.reporting.topics allow for regex?

> On Sep 12, 2023, at 8:26 AM, Kirk True <k...@kirktrue.pro> wrote:
> 
> Hi Qichao,
> 
> Thanks for the KIP!
> 
> Divij—questions/comments inline...
> 
>> On Sep 11, 2023, at 4:32 AM, Divij Vaidya <divijvaidy...@gmail.com> wrote:
>> 
>> Thank you for the proposal Qichao.
>> 
>> I agree with the motivation here and understand the tradeoff here
>> between observability vs. increased metric dimensions (metric fan-out
>> as you say in the KIP).
>> 
>> High level comments:
>> 
>> 1. I would urge you to consider the extensibility of the proposal for
>> other types of metrics. Tomorrow, if we want to selectively add
>> "partition" dimension to another metric, would we have to modify the
>> code where each metric is emitted? Alternatively, could we abstract
>> out this config in a "Kafka Metrics" library. The code provides all
>> information about this library and this library can choose which
>> dimensions it wants to add to the final metrics that are emitted based
>> on declarative configuration.
> 
> I’d agree with this if it doesn’t place a burden on the callers. Are there 
> any potential call sites that don’t have the partition information readily 
> available?
> 
>> 2. Can we offload the handling of this dimension filtering to the
>> metric framework? Have you explored whether prometheus or other
>> libraries provide the ability to dynamically change dimensions
>> associated with metrics?
> 
> I’m not familiar with the downstream metrics providers’ capabilities. This is 
> a greatest common denominator scenario, right? We’d have to be reasonable 
> sure that the heavily used providers *all* support such dynamic filtering.
> 
> Also—and correct me as needed as I’m not familiar with the area—if we 
> relegate partition filtering to a lower layer, we’d still need to store the 
> metric data in memory until it’s flushed, yes? If so, is that overhead of any 
> concern?
> 
>> Implementation level comments:
>> 
>> 1. In the test plan section, please mention what kind of integ and/or
>> unit tests will be added and what they will assert. As an example, you
>> can add a section, "functionality tests", which would assert that new
>> metric config is being respected and another section, "performance
>> tests", which could be a system test and assert that no regression
>> caused wrt resources occupied by metrics from one version to another.
>> 2. Please mention why or why not are we considering dynamically
>> setting the configuration (i.e. without a broker restart)? I would
>> imagine that the ability to dynamically configure for a specific topic
>> will be very useful especially to debug production situations that you
>> mention in the motivation.
> 
> +1
> 
>> 3. You mention that we want to start with metrics closely related to
>> producer & consumers first, which is fair. Could you please add a
>> statement on the work required to extend this to other metrics in
>> future?
> 
> +1
> 
>> 4. In the compatibility section, you mention that this change is
>> backward compatible. I don't fully understand that. During a version
>> upgrade, we will start with an empty list of topics to maintain
>> backward compatibility. I assume after the upgrade, we will update the
>> new config with topic names that we desire to monitor. But updating
>> the config will require a broker restart (a rolling restart since
>> config is read-only). We will be in a situation where some brokers are
>> sending metrics with a new "partition" dimension and some brokers are
>> sending metrics with no partition dimension. Is that acceptable to JMX
>> / prometheus collectors? Would it break them? Please clarify how
>> upgrades will work in the compatibility section.
>> 5. Could you please quantify (with an experiment) the expected perf
>> impact of adding the partition dimension? This could be done as part
>> of "test plan" section and would serve as a data point for users to
>> understand the potential impact if they decide to turn it on.
> 
> Is there some guidance on the level of precision and detail expected when 
> providing the performance numbers in the KIP?
> 
> This notion of proving out the performance impact is important, I agree. 
> Anecdotally, there was another KIP I was following for which performance 
> numbers were requested, as is reasonable. But that caused the KIP to go a bit 
> sideways as a result because it wasn’t able to get consensus on a) the 
> different scenarios to test, and b) the quantitative goal for each. I’m not 
> really sure the rigo(u)r that’s expected at this stage in the development of 
> a new feature.
> 
> Thanks,
> Kirk
> 
>> 
>> --
>> Divij Vaidya
>> 
>> 
>> On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu <qic...@uber.com.invalid> wrote:
>>> 
>>> Hi All,
>>> 
>>> Although this has been discussed many times, I would like to start a new
>>> discussion regarding the introduction of partition-level throughput
>>> metrics. Please review the KIP and I'm eager to know everyone's thoughts:
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
>>> 
>>> TL;DR: The KIP proposes to add partition-level throughput metrics and a new
>>> configuration to control the fan-out rate.
>>> 
>>> Thank you all for the review and have a nice weekend!
>>> 
>>> Best,
>>> Qichao
Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Reply via email to