Oh, and does metrics.partition.level.reporting.topics allow for regex?
> On Sep 12, 2023, at 8:26 AM, Kirk True <k...@kirktrue.pro> wrote:
>
> Hi Qichao,
>
> Thanks for the KIP!
>
> Divij—questions/comments inline...
>
>> On Sep 11, 2023, at 4:32 AM, Divij Vaidya <divijvaidy...@gmail.com> wrote:
>>
>> Thank you for the proposal Qichao.
>>
>> I agree with the motivation here and understand the tradeoff here
>> between observability vs. increased metric dimensions (metric fan-out
>> as you say in the KIP).
>>
>> High level comments:
>>
>> 1. I would urge you to consider the extensibility of the proposal for
>> other types of metrics. Tomorrow, if we want to selectively add
>> "partition" dimension to another metric, would we have to modify the
>> code where each metric is emitted? Alternatively, could we abstract
>> out this config in a "Kafka Metrics" library. The code provides all
>> information about this library and this library can choose which
>> dimensions it wants to add to the final metrics that are emitted based
>> on declarative configuration.
>
> I’d agree with this if it doesn’t place a burden on the callers. Are there
> any potential call sites that don’t have the partition information readily
> available?
>
>> 2. Can we offload the handling of this dimension filtering to the
>> metric framework? Have you explored whether prometheus or other
>> libraries provide the ability to dynamically change dimensions
>> associated with metrics?
>
> I’m not familiar with the downstream metrics providers’ capabilities. This is
> a greatest common denominator scenario, right? We’d have to be reasonable
> sure that the heavily used providers *all* support such dynamic filtering.
>
> Also—and correct me as needed as I’m not familiar with the area—if we
> relegate partition filtering to a lower layer, we’d still need to store the
> metric data in memory until it’s flushed, yes? If so, is that overhead of any
> concern?
>
>> Implementation level comments:
>>
>> 1. In the test plan section, please mention what kind of integ and/or
>> unit tests will be added and what they will assert. As an example, you
>> can add a section, "functionality tests", which would assert that new
>> metric config is being respected and another section, "performance
>> tests", which could be a system test and assert that no regression
>> caused wrt resources occupied by metrics from one version to another.
>> 2. Please mention why or why not are we considering dynamically
>> setting the configuration (i.e. without a broker restart)? I would
>> imagine that the ability to dynamically configure for a specific topic
>> will be very useful especially to debug production situations that you
>> mention in the motivation.
>
> +1
>
>> 3. You mention that we want to start with metrics closely related to
>> producer & consumers first, which is fair. Could you please add a
>> statement on the work required to extend this to other metrics in
>> future?
>
> +1
>
>> 4. In the compatibility section, you mention that this change is
>> backward compatible. I don't fully understand that. During a version
>> upgrade, we will start with an empty list of topics to maintain
>> backward compatibility. I assume after the upgrade, we will update the
>> new config with topic names that we desire to monitor. But updating
>> the config will require a broker restart (a rolling restart since
>> config is read-only). We will be in a situation where some brokers are
>> sending metrics with a new "partition" dimension and some brokers are
>> sending metrics with no partition dimension. Is that acceptable to JMX
>> / prometheus collectors? Would it break them? Please clarify how
>> upgrades will work in the compatibility section.
>> 5. Could you please quantify (with an experiment) the expected perf
>> impact of adding the partition dimension? This could be done as part
>> of "test plan" section and would serve as a data point for users to
>> understand the potential impact if they decide to turn it on.
>
> Is there some guidance on the level of precision and detail expected when
> providing the performance numbers in the KIP?
>
> This notion of proving out the performance impact is important, I agree.
> Anecdotally, there was another KIP I was following for which performance
> numbers were requested, as is reasonable. But that caused the KIP to go a bit
> sideways as a result because it wasn’t able to get consensus on a) the
> different scenarios to test, and b) the quantitative goal for each. I’m not
> really sure the rigo(u)r that’s expected at this stage in the development of
> a new feature.
>
> Thanks,
> Kirk
>
>>
>> --
>> Divij Vaidya
>>
>>
>> On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu <qic...@uber.com.invalid> wrote:
>>>
>>> Hi All,
>>>
>>> Although this has been discussed many times, I would like to start a new
>>> discussion regarding the introduction of partition-level throughput
>>> metrics. Please review the KIP and I'm eager to know everyone's thoughts:
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
>>>
>>> TL;DR: The KIP proposes to add partition-level throughput metrics and a new
>>> configuration to control the fan-out rate.
>>>
>>> Thank you all for the review and have a nice weekend!
>>>
>>> Best,
>>> Qichao