Hello Everyone, Can I ask for some feedback regarding KIP-977 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics> ?
Best, Qichao Chu Software Engineer | Data - Kafka [image: Uber] <https://uber.com/> On Mon, Oct 16, 2023 at 7:34 PM Qichao Chu <qic...@uber.com> wrote: > Hi Divij and Kirk, > > Thank you both for providing the valuable feedback and sorry for the > delay. I have just updated the KIP to address the comments. > > 1. Instead of using a topic-level control, global verbosity control > makes more sense if we want to extend it in the future. It would be very > difficult if we want to apply the topic allowlist everywhere > 2. Also, the topic allowlist was not dynamic which makes everything > quite complex, especially for the topic lifecycle management. By using the > dynamic global config, debugging could be easier, and management of the > config is also made easier. > 3. More details are included in the test section. > > One thing that still misses is the performance numbers. I will get it > ready with our internal clusters and share out soon. > > Many thanks for the review! > Qichao > > On Tue, Sep 12, 2023 at 8:31 AM Kirk True <k...@kirktrue.pro> wrote: > >> Oh, and does metrics.partition.level.reporting.topics allow for regex? >> >> > On Sep 12, 2023, at 8:26 AM, Kirk True <k...@kirktrue.pro> wrote: >> > >> > Hi Qichao, >> > >> > Thanks for the KIP! >> > >> > Divij—questions/comments inline... >> > >> >> On Sep 11, 2023, at 4:32 AM, Divij Vaidya <divijvaidy...@gmail.com> >> wrote: >> >> >> >> Thank you for the proposal Qichao. >> >> >> >> I agree with the motivation here and understand the tradeoff here >> >> between observability vs. increased metric dimensions (metric fan-out >> >> as you say in the KIP). >> >> >> >> High level comments: >> >> >> >> 1. I would urge you to consider the extensibility of the proposal for >> >> other types of metrics. Tomorrow, if we want to selectively add >> >> "partition" dimension to another metric, would we have to modify the >> >> code where each metric is emitted? Alternatively, could we abstract >> >> out this config in a "Kafka Metrics" library. The code provides all >> >> information about this library and this library can choose which >> >> dimensions it wants to add to the final metrics that are emitted based >> >> on declarative configuration. >> > >> > I’d agree with this if it doesn’t place a burden on the callers. Are >> there any potential call sites that don’t have the partition information >> readily available? >> > >> >> 2. Can we offload the handling of this dimension filtering to the >> >> metric framework? Have you explored whether prometheus or other >> >> libraries provide the ability to dynamically change dimensions >> >> associated with metrics? >> > >> > I’m not familiar with the downstream metrics providers’ capabilities. >> This is a greatest common denominator scenario, right? We’d have to be >> reasonable sure that the heavily used providers *all* support such dynamic >> filtering. >> > >> > Also—and correct me as needed as I’m not familiar with the area—if we >> relegate partition filtering to a lower layer, we’d still need to store the >> metric data in memory until it’s flushed, yes? If so, is that overhead of >> any concern? >> > >> >> Implementation level comments: >> >> >> >> 1. In the test plan section, please mention what kind of integ and/or >> >> unit tests will be added and what they will assert. As an example, you >> >> can add a section, "functionality tests", which would assert that new >> >> metric config is being respected and another section, "performance >> >> tests", which could be a system test and assert that no regression >> >> caused wrt resources occupied by metrics from one version to another. >> >> 2. Please mention why or why not are we considering dynamically >> >> setting the configuration (i.e. without a broker restart)? I would >> >> imagine that the ability to dynamically configure for a specific topic >> >> will be very useful especially to debug production situations that you >> >> mention in the motivation. >> > >> > +1 >> > >> >> 3. You mention that we want to start with metrics closely related to >> >> producer & consumers first, which is fair. Could you please add a >> >> statement on the work required to extend this to other metrics in >> >> future? >> > >> > +1 >> > >> >> 4. In the compatibility section, you mention that this change is >> >> backward compatible. I don't fully understand that. During a version >> >> upgrade, we will start with an empty list of topics to maintain >> >> backward compatibility. I assume after the upgrade, we will update the >> >> new config with topic names that we desire to monitor. But updating >> >> the config will require a broker restart (a rolling restart since >> >> config is read-only). We will be in a situation where some brokers are >> >> sending metrics with a new "partition" dimension and some brokers are >> >> sending metrics with no partition dimension. Is that acceptable to JMX >> >> / prometheus collectors? Would it break them? Please clarify how >> >> upgrades will work in the compatibility section. >> >> 5. Could you please quantify (with an experiment) the expected perf >> >> impact of adding the partition dimension? This could be done as part >> >> of "test plan" section and would serve as a data point for users to >> >> understand the potential impact if they decide to turn it on. >> > >> > Is there some guidance on the level of precision and detail expected >> when providing the performance numbers in the KIP? >> > >> > This notion of proving out the performance impact is important, I >> agree. Anecdotally, there was another KIP I was following for which >> performance numbers were requested, as is reasonable. But that caused the >> KIP to go a bit sideways as a result because it wasn’t able to get >> consensus on a) the different scenarios to test, and b) the quantitative >> goal for each. I’m not really sure the rigo(u)r that’s expected at this >> stage in the development of a new feature. >> > >> > Thanks, >> > Kirk >> > >> >> >> >> -- >> >> Divij Vaidya >> >> >> >> >> >> On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu <qic...@uber.com.invalid> >> wrote: >> >>> >> >>> Hi All, >> >>> >> >>> Although this has been discussed many times, I would like to start a >> new >> >>> discussion regarding the introduction of partition-level throughput >> >>> metrics. Please review the KIP and I'm eager to know everyone's >> thoughts: >> >>> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics >> >>> >> >>> TL;DR: The KIP proposes to add partition-level throughput metrics and >> a new >> >>> configuration to control the fan-out rate. >> >>> >> >>> Thank you all for the review and have a nice weekend! >> >>> >> >>> Best, >> >>> Qichao >> >>