Hi Divij, Thank you for the review and the great suggestions, again. I have updated the corresponding content, can you take another look? Regarding the KIP-544 style regex, I have added it to the new property too. It's expanded to include multiple sections for better future extension.
Best, Qichao Chu Software Engineer | Data - Kafka [image: Uber] <https://uber.com/> On Mon, Oct 30, 2023 at 6:26 PM Divij Vaidya <divijvaidy...@gmail.com> wrote: > Hey *Qichao* > > Thank you for the update on the KIP. I like the idea of incremental > delivery and adding which metrics support this verbosity as a later KIP. > But I also want to ensure that we wouldn't have to change the current > config when adding that in future. Hence, we need some discussion on it in > the scope of the KIP. > > About the dynamic configuration: > Do we need to add the "default" mode? I am asking because it may inhibit us > from adding the allowList option in future. Instead if we could rephrase > the config as: "metric.verbosity.high" which takes values as a regEx > (default will be empty), then we wouldn't have to worry about > future-proofness of this KIP. Notably this is an existing pattern used by > KIP-544. > Alternatively, if you choose to stick to the current configuration pattern, > please provide information on how this config will look like when we add > allow listing in future. > > About the perf test: > Motivation - The motivation of perf test is to provide users with a hint on > what perf penalty they can expect and whether default has degraded perf > (due to additional "empty" labels). > Dimensions of the test could be - scrape interval, utilization of broker > (no traffic vs. heavy traffic), number of partitions (small/200 to > large/2k). > Things to collect during perf test - number of mbeans registered with JMX, > CPU, heap utilization > Expected results - As long as we can prove that there is no additional > usage (significant) of CPU or heap after this change for the "default > mode", we should be good. For the "high" mode, we should document the > expected increase for users but it is not a blocker to implement this KIP. > > > *Kirk*, I have tried to clarify the expectation on performance, does that > address your question earlier? Also, I am happy with having a Kafka level > dynamic config that we can use to filter our metric/dimensionality since we > have a precedence at KIP-544. Hence, my suggestion to push this filtering > to metric library can be ignored. > > -- > Divij Vaidya > > > > On Sat, Oct 28, 2023 at 11:37 AM Qichao Chu <qic...@uber.com.invalid> > wrote: > > > Hello Everyone, > > > > Can I ask for some feedback regarding KIP-977 > > < > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics > > > > > ? > > > > Best, > > Qichao Chu > > Software Engineer | Data - Kafka > > [image: Uber] <https://uber.com/> > > > > > > On Mon, Oct 16, 2023 at 7:34 PM Qichao Chu <qic...@uber.com> wrote: > > > > > Hi Divij and Kirk, > > > > > > Thank you both for providing the valuable feedback and sorry for the > > > delay. I have just updated the KIP to address the comments. > > > > > > 1. Instead of using a topic-level control, global verbosity control > > > makes more sense if we want to extend it in the future. It would be > > very > > > difficult if we want to apply the topic allowlist everywhere > > > 2. Also, the topic allowlist was not dynamic which makes everything > > > quite complex, especially for the topic lifecycle management. By > > using the > > > dynamic global config, debugging could be easier, and management of > > the > > > config is also made easier. > > > 3. More details are included in the test section. > > > > > > One thing that still misses is the performance numbers. I will get it > > > ready with our internal clusters and share out soon. > > > > > > Many thanks for the review! > > > Qichao > > > > > > On Tue, Sep 12, 2023 at 8:31 AM Kirk True <k...@kirktrue.pro> wrote: > > > > > >> Oh, and does metrics.partition.level.reporting.topics allow for regex? > > >> > > >> > On Sep 12, 2023, at 8:26 AM, Kirk True <k...@kirktrue.pro> wrote: > > >> > > > >> > Hi Qichao, > > >> > > > >> > Thanks for the KIP! > > >> > > > >> > Divij—questions/comments inline... > > >> > > > >> >> On Sep 11, 2023, at 4:32 AM, Divij Vaidya <divijvaidy...@gmail.com > > > > >> wrote: > > >> >> > > >> >> Thank you for the proposal Qichao. > > >> >> > > >> >> I agree with the motivation here and understand the tradeoff here > > >> >> between observability vs. increased metric dimensions (metric > fan-out > > >> >> as you say in the KIP). > > >> >> > > >> >> High level comments: > > >> >> > > >> >> 1. I would urge you to consider the extensibility of the proposal > for > > >> >> other types of metrics. Tomorrow, if we want to selectively add > > >> >> "partition" dimension to another metric, would we have to modify > the > > >> >> code where each metric is emitted? Alternatively, could we abstract > > >> >> out this config in a "Kafka Metrics" library. The code provides all > > >> >> information about this library and this library can choose which > > >> >> dimensions it wants to add to the final metrics that are emitted > > based > > >> >> on declarative configuration. > > >> > > > >> > I’d agree with this if it doesn’t place a burden on the callers. Are > > >> there any potential call sites that don’t have the partition > information > > >> readily available? > > >> > > > >> >> 2. Can we offload the handling of this dimension filtering to the > > >> >> metric framework? Have you explored whether prometheus or other > > >> >> libraries provide the ability to dynamically change dimensions > > >> >> associated with metrics? > > >> > > > >> > I’m not familiar with the downstream metrics providers’ > capabilities. > > >> This is a greatest common denominator scenario, right? We’d have to be > > >> reasonable sure that the heavily used providers *all* support such > > dynamic > > >> filtering. > > >> > > > >> > Also—and correct me as needed as I’m not familiar with the area—if > we > > >> relegate partition filtering to a lower layer, we’d still need to > store > > the > > >> metric data in memory until it’s flushed, yes? If so, is that overhead > > of > > >> any concern? > > >> > > > >> >> Implementation level comments: > > >> >> > > >> >> 1. In the test plan section, please mention what kind of integ > and/or > > >> >> unit tests will be added and what they will assert. As an example, > > you > > >> >> can add a section, "functionality tests", which would assert that > new > > >> >> metric config is being respected and another section, "performance > > >> >> tests", which could be a system test and assert that no regression > > >> >> caused wrt resources occupied by metrics from one version to > another. > > >> >> 2. Please mention why or why not are we considering dynamically > > >> >> setting the configuration (i.e. without a broker restart)? I would > > >> >> imagine that the ability to dynamically configure for a specific > > topic > > >> >> will be very useful especially to debug production situations that > > you > > >> >> mention in the motivation. > > >> > > > >> > +1 > > >> > > > >> >> 3. You mention that we want to start with metrics closely related > to > > >> >> producer & consumers first, which is fair. Could you please add a > > >> >> statement on the work required to extend this to other metrics in > > >> >> future? > > >> > > > >> > +1 > > >> > > > >> >> 4. In the compatibility section, you mention that this change is > > >> >> backward compatible. I don't fully understand that. During a > version > > >> >> upgrade, we will start with an empty list of topics to maintain > > >> >> backward compatibility. I assume after the upgrade, we will update > > the > > >> >> new config with topic names that we desire to monitor. But updating > > >> >> the config will require a broker restart (a rolling restart since > > >> >> config is read-only). We will be in a situation where some brokers > > are > > >> >> sending metrics with a new "partition" dimension and some brokers > are > > >> >> sending metrics with no partition dimension. Is that acceptable to > > JMX > > >> >> / prometheus collectors? Would it break them? Please clarify how > > >> >> upgrades will work in the compatibility section. > > >> >> 5. Could you please quantify (with an experiment) the expected perf > > >> >> impact of adding the partition dimension? This could be done as > part > > >> >> of "test plan" section and would serve as a data point for users to > > >> >> understand the potential impact if they decide to turn it on. > > >> > > > >> > Is there some guidance on the level of precision and detail expected > > >> when providing the performance numbers in the KIP? > > >> > > > >> > This notion of proving out the performance impact is important, I > > >> agree. Anecdotally, there was another KIP I was following for which > > >> performance numbers were requested, as is reasonable. But that caused > > the > > >> KIP to go a bit sideways as a result because it wasn’t able to get > > >> consensus on a) the different scenarios to test, and b) the > quantitative > > >> goal for each. I’m not really sure the rigo(u)r that’s expected at > this > > >> stage in the development of a new feature. > > >> > > > >> > Thanks, > > >> > Kirk > > >> > > > >> >> > > >> >> -- > > >> >> Divij Vaidya > > >> >> > > >> >> > > >> >> On Sat, Sep 9, 2023 at 8:18 PM Qichao Chu <qic...@uber.com.invalid > > > > >> wrote: > > >> >>> > > >> >>> Hi All, > > >> >>> > > >> >>> Although this has been discussed many times, I would like to > start a > > >> new > > >> >>> discussion regarding the introduction of partition-level > throughput > > >> >>> metrics. Please review the KIP and I'm eager to know everyone's > > >> thoughts: > > >> >>> > > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-977%3A+Partition-Level+Throughput+Metrics > > >> >>> > > >> >>> TL;DR: The KIP proposes to add partition-level throughput metrics > > and > > >> a new > > >> >>> configuration to control the fan-out rate. > > >> >>> > > >> >>> Thank you all for the review and have a nice weekend! > > >> >>> > > >> >>> Best, > > >> >>> Qichao > > >> > > >> > > >