Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Divij Vaidya Mon, 11 Sep 2023 04:33:51 -0700

Thank you for the proposal Qichao.

I agree with the motivation here and understand the tradeoff here
between observability vs. increased metric dimensions (metric fan-out
as you say in the KIP).

High level comments:

1. I would urge you to consider the extensibility of the proposal for
other types of metrics. Tomorrow, if we want to selectively add
"partition" dimension to another metric, would we have to modify the
code where each metric is emitted? Alternatively, could we abstract
out this config in a "Kafka Metrics" library. The code provides all
information about this library and this library can choose which
dimensions it wants to add to the final metrics that are emitted based
on declarative configuration.

2. Can we offload the handling of this dimension filtering to the
metric framework? Have you explored whether prometheus or other
libraries provide the ability to dynamically change dimensions
associated with metrics?

Implementation level comments:

1. In the test plan section, please mention what kind of integ and/or
unit tests will be added and what they will assert. As an example, you
can add a section, "functionality tests", which would assert that new
metric config is being respected and another section, "performance
tests", which could be a system test and assert that no regression
caused wrt resources occupied by metrics from one version to another.
2. Please mention why or why not are we considering dynamically
setting the configuration (i.e. without a broker restart)? I would
imagine that the ability to dynamically configure for a specific topic
will be very useful especially to debug production situations that you
mention in the motivation.
3. You mention that we want to start with metrics closely related to
producer & consumers first, which is fair. Could you please add a
statement on the work required to extend this to other metrics in
future?
4. In the compatibility section, you mention that this change is
backward compatible. I don't fully understand that. During a version
upgrade, we will start with an empty list of topics to maintain
backward compatibility. I assume after the upgrade, we will update the
new config with topic names that we desire to monitor. But updating
the config will require a broker restart (a rolling restart since
config is read-only). We will be in a situation where some brokers are
sending metrics with a new "partition" dimension and some brokers are
sending metrics with no partition dimension. Is that acceptable to JMX
/ prometheus collectors? Would it break them? Please clarify how
upgrades will work in the compatibility section.
5. Could you please quantify (with an experiment) the expected perf
impact of adding the partition dimension? This could be done as part
of "test plan" section and would serve as a data point for users to
understand the potential impact if they decide to turn it on.

--
Divij Vaidya

On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu <qic...@uber.com.invalid> wrote:
>
> Hi All,
>
> Although this has been discussed many times, I would like to start a new
> discussion regarding the introduction of partition-level throughput
> metrics. Please review the KIP and I'm eager to know everyone's thoughts:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics
>
> TL;DR: The KIP proposes to add partition-level throughput metrics and a new
> configuration to control the fan-out rate.
>
> Thank you all for the review and have a nice weekend!
>
> Best,
> Qichao

Re: [DISCUSS] KIP-977: Partition-Level Throughput Metrics

Reply via email to