Re: Review Request 34170: Patch for KAFKA-2191

Jay Kreps Wed, 13 May 2015 17:39:57 -0700


> On May 13, 2015, 5:14 p.m., Jay Kreps wrote:
> > clients/src/main/java/org/apache/kafka/common/metrics/stats/Rate.java, line 
> > 62
> > <https://reviews.apache.org/r/34170/diff/1/?file=958215#file958215line62>
> >
> >     Is this actually right? I agree you'll get discontinuities as the 
> > measured time shrinks to zero but why is giving back 0 the right answer? In 
> > the case you guys were testing 0 was safe, but imagine a case where the 
> > monitoring was checking that the value didn't fall below some threshold.
> 
> Dong Lin wrote:
>     The question is, when Rate.measure() is called right after 
> Rate.record(n), what should be the return value? I think there are two 
> possibilities: 0 and n/config.timeWindowMs(). I didn't find any use case 
> where these two values make a difference. 
>     
>     Which value do you think is the best?
>     
>     Thank you.
> 
> Aditya Auradkar wrote:
>     I think returning 0 is reasonable if no time has elapsed (technically).
>     As an alternate solution, what if we assumed that "elapsed" is always 1 
> (at least). For example:
>     
>     double elapsed = convert(now - stat.oldest(now).lastWindowMs) + 1
>     
>     In case of seconds, this basically means that you assume the current 
> second is always complete. This is only a problem (for a couple of seconds) 
> when all previous samples have zero activity or when the server is just 
> starting up.

Actually there is a fundamental issue with the computation that patching around 
the 0 case doesn't fix. That is the instability of the estimate. This is the 
"taking a poll with sample size one" problem.

Even if you patch the 0 case you still get a bad answer 1 ms later. That is, 
let's say you get a single 50k request and your quota is 1MB/sec. Currently at 
0ms we estimate infinity which is in fact the measured rate but obviously not a 
good estimate. But even 1 ms later the estimate is bad. 50k*1000ms = ~50MB/sec.

This is somewhat rare because it only happens when there is just one sample.

They key observation is that if a sample is missing, nothing happened in that 
time period. But the calculation should still use that time period.

So the right way to compute it, I think, is actually
  ellapsed = (num_samples-1)*window_size + (now - current_sample.begin)

For safety I think we should also require the number of samples to be >= 2 and 
default it to 3.

- Jay

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/34170/#review83629
-----------------------------------------------------------

On May 13, 2015, 10:32 p.m., Dong Lin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/34170/
> -----------------------------------------------------------
> 
> (Updated May 13, 2015, 10:32 p.m.)
> 
> 
> Review request for kafka.
> 
> 
> Bugs: KAFKA-2191
>     https://issues.apache.org/jira/browse/KAFKA-2191
> 
> 
> Repository: kafka
> 
> 
> Description
> -------
> 
> KAFKA-2191; Measured rate should not be infinite
> 
> 
> Diffs
> -----
> 
>   clients/src/main/java/org/apache/kafka/common/metrics/stats/Rate.java 
> 98429da34418f7f1deba1b5e44e2e6025212edb3 
>   
> clients/src/main/java/org/apache/kafka/common/metrics/stats/SampledStat.java 
> b341b7daaa10204906d78b812fb05fd27bc69373 
> 
> Diff: https://reviews.apache.org/r/34170/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Dong Lin
> 
>

Re: Review Request 34170: Patch for KAFKA-2191

Reply via email to