After further discussion with Anna, we decided the following:
- add the average metric, but noted in the KIP that the average value may
look low at times when there are many empty partitions that have a 0ms load
time
- set the window to 30sec, because there is no significance difference if
we set the window time to 3hrs, so I will keep the default value instead.

Thanks. Let me know any more concerns.
Anastasia

On Mon, Jul 1, 2019 at 9:13 AM Anastasia Vela <av...@confluent.io> wrote:

> Hey Gwen!
>
> Thanks for reviewing my KIP!
>
> 1. I did consider adding an Avg metric as well. Anna and I decided that a
> max would provide the crucial information. We just need to know if there
> was a long load time, and expose what that duration was so we understand
> there's downtime for such a long time. However I do agree that it may be
> necessary to compute averages if we want to give the max a reference point.
> I can easily add this if we believe it is necessary.
> 2. The default refers to the metric configuration set when you initialize
> KafkaServer. When I was running tests, the max value was computed over a
> window of 30 seconds, unless I changed the metrics config. So I noted that
> unless we change the config for this specific metric, it will be computed
> over the default window.
> 3. I proposed a 3 hour window because we have (very rarely) seen
> partitions take hours to load. 3 hours was an upper bound for how long a
> load could take. The way max works is that it computes the running max
> until the window has lapsed. Then it starts a new window and forgets the
> max value of the last window. So if a partition takes more than the window
> time to load, there will be one value in that window and the next load will
> be part of a new window. I guess it just depends on how we want it to be
> displayed on the graph. If it's ok for this behavior to happen, the window
> can be shrunk. Regarding the rate metric, I was actually thinking about
> doing this, but I was told that loads don't happen very often. But it is
> true that if the reload happens very often then that may be a problem.
>
> Thanks,
> Anastasia
>
> On Fri, Jun 28, 2019 at 4:27 PM Gwen Shapira <g...@confluent.io> wrote:
>
>> Hey,
>>
>> Thank you for proposing this! Sounds really useful - we have
>> definitely seem some difficult to explain pauses in consumer activity
>> and this metric will let us correlate those.
>>
>> Few questions:
>> 1. Did you consider adding both Max and Avg metrics? Many of our
>> metrics have both (batch-size and message-size for example) and it
>> helps put the max value in context.
>> 2. You wrote: "Lengthening or shortening the 3 hour time window is up
>> for discussion (default is 30sec)."  and I'm not sure what default you
>> are referring to?
>> 3. Can you also give some background on why you are proposing 3h? I'm
>> guessing it is because loading the state from the topic happens rarely
>> enough that in 3h it will probably only happen once or not at all?
>> Perhaps we need a rate metric to see how often it actually happens (if
>> we have to reload offsets very often it is a different problem).
>>
>> Gwen
>>
>> On Tue, Jun 25, 2019 at 4:43 PM Anastasia Vela <av...@confluent.io>
>> wrote:
>> >
>> > Hi all,
>> >
>> > I'd like to discuss KIP-484:
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-484%3A+Expose+metrics+for+group+and+transaction+metadata+loading+duration
>> >
>> > Let me know what you think!
>> >
>> > Thanks,
>> > Anastasia
>>
>>
>>
>> --
>> Gwen Shapira
>> Product Manager | Confluent
>> 650.450.2760 | @gwenshap
>> Follow us: Twitter | blog
>>
>

Reply via email to