Re: Metrics in new producer

Martin Kleppmann Sat, 22 Feb 2014 09:26:10 -0800

Not sure if you want yet another opinion added to the pile -- but since I had a 
similar problem on another project recently, I thought I'd weigh in. (On that 
project we were originally using Coda's library, but then switched to rolling 
our own metrics implementation because we needed to do a few things 
differently.)


1. Problems we encountered with Coda's library: it uses an 
exponentially-weighted moving average (EMWA) for rates (eg. messages/sec), and 
exponentially biased reservoir sampling for histograms (percentiles, averages). 
Those methods of calculation work well for events with a consistently high 
volume, but they give strange and misleading results for events that are bursty 
or rare (eg error rates). We found that a fixed-size window gives more 
predictable, easier-to-interpret results.

2. In defence of Coda's library, I think its histogram implementation is a good 
trade-off of memory for accuracy; I'm not totally convinced that your proposal 
(counts of events in a fixed set of buckets) would be much better. Would have 
to do some math to work out the expected accuracy in each case. The reservoir 
sampling can be configured to use a smaller sample if the default of 1028 
samples is too expensive. Reservoir sampling also has the advantage that you 
don't need to hard-code a bucket distribution.

3. Quotas are an interesting use case. However, I'm not wild about using a 
QuotaViolationException for control flow -- I think an explicit conditional 
would be nicer than having to catch an exception. One question in that context: 
if a quota is exceeded, do you still want to count the event towards the 
metric, or do you want to stop counting it until the quota is replenished? The 
answer may depend on the particular metric.

4. If you decide to go with Coda's library, I would advocate isolating the 
dependency into a separate module and using it via a facade -- somewhat like 
using SLF4J instead of Log4j directly. It's ok for Coda's library to be the 
default metrics implementation, but it should be easy to swap it out for 
something different in case someone has a version conflict or differing 
requirements. The facade should be at a low level (individual events), not at 
the reporter level (which deals with pre-aggregated values, and is already 
pluggable).

5. If it's useful, I can probably contribute my simple (but imho effective) 
metrics library, for embedding into Kafka. It uses reservoir sampling for 
percentiles, like Coda's library, but uses a fixed-size window instead of an 
exponential bias, which avoids weird behaviour on bursty metrics.

In summary, I would advocate one of the following approaches:
- Coda Hale library via facade (allowing it to be swapped for something else), 
or
- Own metrics implementation, provided that we have confidence in its 
implementation of percentiles.

Martin


On 22 Feb 2014, at 01:06, Jay Kreps <jay.kr...@gmail.com> wrote:
> Hey guys,
> 
> Just picking up this thread again. I do want to drive a conclusion as I
> will run out of work to do on the producer soon and will need to add
> metrics of some sort. We can vote on it, but I'm not sure if we actually
> got everything discussed.
> 
> Joel, I wasn't fully sure how to interpret your comment. I think you are
> saying you are cool with the new metrics package as long as it really is
> better. Do you have any comment on whether you think the benefits I
> outlined are worth it? I agree with you that we could hold off on a second
> repo until someone else would actually want to use our code.
> 
> Jun, I'm not averse to doing a sampling-based histogram and doing some
> comparison between the two approaches if you think this approach is
> otherwise better.
> 
> Sriram, originally I thought you preferred just sticking to Coda Hale, but
> after your follow-up email I wasn't really sure...
> 
> Joe/Clark, yes this code allows pluggable reporting so you could have a
> metrics reporter that just wraps each metric in a Coda Hale Gauge if that
> is useful. Though obviously if enough people were doing that I would think
> it would be worth just using the Coda Hale package directly...
> 
> -Jay
> 
> 
> 
> 
> On Thu, Feb 13, 2014 at 3:34 PM, Clark Breyman <cl...@breyman.com> wrote:
> 
>> Not requiring the client to link Coda/Yammer metrics sounds like a
>> compelling reason to pivot to new interfaces. If that's the agreed
>> direction, I'm hoping that we'd get the choice of backend to provide (e.g.
>> facade on Yammer metrics for those with an investment in that) rather than
>> force the new backend.  Having a metrics factory seems better for this than
>> directly instantiating the singleton registry.
>> 
>> 
>> On Thu, Feb 13, 2014 at 2:39 PM, Joe Stein <joe.st...@stealth.ly> wrote:
>> 
>>> Can we leave metrics and have multiple supported KafkaMetricsGroup
>>> implementing a yammer based implementation?
>>> 
>>> ProducerRequestStats with your configured analytics group?
>>> 
>>> On Thu, Feb 13, 2014 at 11:37 AM, Jay Kreps <jay.kr...@gmail.com> wrote:
>>> 
>>>> I think we discussed the scala/java stuff more fully previously.
>>>> Essentially the client is embedded everywhere. Scala is very
>> incompatible
>>>> with itself so this makes it very hard to use for people using anything
>>>> else in scala. Also Scala stack traces are very confusing. Basically we
>>>> thought plain java code would be a lot easier for people to use. Even
>> if
>>>> Scala is more fun to write, that isn't really what we are optimizing
>> for.
>>>> 
>>>> -Jay
>>>> 
>>>> 
>>>> On Thu, Feb 13, 2014 at 8:09 AM, S Ahmed <sahmed1...@gmail.com> wrote:
>>>> 
>>>>> Jay, pretty impressive how you just write a 'quick version' like that
>>> :)
>>>>> Not to get off-topic but why didn't you write this in scala?
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Feb 12, 2014 at 6:54 PM, Joel Koshy <jjkosh...@gmail.com>
>>> wrote:
>>>>> 
>>>>>> I have not had a chance to review the new metrics code and its
>>>>>> features carefully (apart from your write-up), but here are my
>>> general
>>>>>> thoughts:
>>>>>> 
>>>>>> Implementing a metrics package correctly is difficult; more so for
>>>>>> people like me, because I'm not a statistician.  However, if this
>> new
>>>>>> package: {(i) functions correctly (and we need to define and prove
>>>>>> correctness), (ii) is easy to use, (iii) serves all our current and
>>>>>> anticipated monitoring needs, (iv) is not overly complex that it
>>>>>> becomes a burden to maintain and we are better of with an available
>>>>>> library;} then I think it makes sense to embed it and use it within
>>>>>> the Kafka code. The main wins are: (i) predictability (no changing
>>>>>> APIs and intimate knowledge of the code) and (ii) control with
>>> respect
>>>>>> to both functionality (e.g., there are hard-coded decay constants
>> in
>>>>>> metrics-core 2.x) and correctness (i.e., if we find a bug in the
>>>>>> metrics package we have to submit a pull request and wait for it to
>>>>>> become mainstream).  I'm not sure it would help very much to pull
>> it
>>>>>> into a separate repo because that could potentially annul these
>>>>>> benefits.
>>>>>> 
>>>>>> Joel
>>>>>> 
>>>>>> On Wed, Feb 12, 2014 at 02:50:43PM -0800, Jay Kreps wrote:
>>>>>>> Sriram,
>>>>>>> 
>>>>>>> Makes sense. I am cool moving this stuff into its own repo if
>>> people
>>>>>> think
>>>>>>> that is better. I'm not sure it would get much contribution but
>>> when
>>>> I
>>>>>>> started messing with this I did have a lot of grand ideas of
>> making
>>>>>> adding
>>>>>>> metrics to a sensor dynamic so you could add more stuff in
>>>>> real-time(via
>>>>>>> jmx, say) and/or externalize all your metrics and config to a
>>>> separate
>>>>>> file
>>>>>>> like log4j with only the points of instrumentation hard-coded.
>>>>>>> 
>>>>>>> -Jay
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Feb 12, 2014 at 2:07 PM, Sriram Subramanian <
>>>>>>> srsubraman...@linkedin.com> wrote:
>>>>>>> 
>>>>>>>> I am actually neutral to this change. I found the replies were
>>> more
>>>>>>>> towards the implementation and features so far. I would like
>> the
>>>>>> community
>>>>>>>> to think about the questions below before making a decision. My
>>>>>> opinion on
>>>>>>>> this is that it has potential to be its own project and it
>> would
>>>>>> attract
>>>>>>>> developers who are specifically interested in contributing to
>>>>> metrics.
>>>>>> I
>>>>>>>> am skeptical that the Kafka contributors would focus on
>> improving
>>>>> this
>>>>>>>> library (apart from bug fixes) instead of
>> developing/contributing
>>>> to
>>>>>> other
>>>>>>>> core pieces. It would be useful to continue and keep it
>> decoupled
>>>>> from
>>>>>>>> rest of Kafka (if it resides in the Kafka code base.) so that
>> we
>>>> can
>>>>>> move
>>>>>>>> it out anytime to its own project.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On 2/12/14 1:21 PM, "Jay Kreps" <jay.kr...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>>> Hey Sriram,
>>>>>>>>> 
>>>>>>>>> Not sure if these are actually meant as questions or more
>> veiled
>>>>>> comments.
>>>>>>>>> In an case I tried to give my 2 cents inline.
>>>>>>>>> 
>>>>>>>>> On Tue, Feb 11, 2014 at 11:12 PM, Sriram Subramanian <
>>>>>>>>> srsubraman...@linkedin.com> wrote:
>>>>>>>>> 
>>>>>>>>>> I think answering the questions below would help to make a
>>>> better
>>>>>>>>>> decision. I am all for writing better code and having
>> superior
>>>>>>>>>> functionalities but it is worth thinking about stuff outside
>>>> just
>>>>>> code
>>>>>>>>>> in
>>>>>>>>>> this case -
>>>>>>>>>> 
>>>>>>>>>> 1. Does metric form a core piece of kafka? Does it help
>> kafka
>>>>>> greatly in
>>>>>>>>>> providing better core functionalities? I would always like a
>>>>>> project to
>>>>>>>>>> do
>>>>>>>>>> one thing really well. Metrics is a non trivial amount of
>>> code.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Metrics are obviously important, and obviously improving our
>>>> metrics
>>>>>>>>> system
>>>>>>>>> would be good. That said this may or may not be better, and
>> even
>>>> if
>>>>>> it is
>>>>>>>>> better that betterness might not outweigh other
>> considerations.
>>>> That
>>>>>> is
>>>>>>>>> what we are discussing.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 2. Does it make sense to be part of Kafka or its own
>> project?
>>> If
>>>>>> this
>>>>>>>>>> metrics library has the potential to be better than
>>>> metrics-core,
>>>>> I
>>>>>>>>>> would
>>>>>>>>>> be interested in other projects take advantage of it.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> It could be either.
>>>>>>>>> 
>>>>>>>>> 3. Can Kafka maintain this library as new members join and old
>>>>> members
>>>>>>>>>> leave? Would this be a piece of code that no one (in Kafka)
>> in
>>>> the
>>>>>>>>>> future
>>>>>>>>>> spends time improving if the original author left?
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I am not going anywhere in the near term, but if I did, yes,
>>> this
>>>>>> would be
>>>>>>>>> like any other code we have. As with yammer metrics or any
>> other
>>>>> code
>>>>>> at
>>>>>>>>> that point we would either use it as is or someone would
>> improve
>>>> it.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 4. Does it affect the schedule of producer rewrite? This
>> needs
>>>> its
>>>>>> own
>>>>>>>>>> stabilization and modification to existing metric dashboards
>>> if
>>>>> the
>>>>>>>>>> format
>>>>>>>>>> is changed. Many times such cost are not factored in and a
>>>> project
>>>>>> loses
>>>>>>>>>> time before realizing the extra time required to make a
>>> library
>>>> as
>>>>>> this
>>>>>>>>>> operational.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Probably not. The metrics are going to change regardless of
>>>> whether
>>>>>> we use
>>>>>>>>> the same library or not. If we think this is better I don't
>> mind
>>>>>> putting
>>>>>>>>> in
>>>>>>>>> a little extra effort to get there.
>>>>>>>>> 
>>>>>>>>> Irrespective I think this is probably not the right thing to
>>>>> optimize
>>>>>> for.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> I am sure we can do better when we write code to a specific
>>> use
>>>>>> case (in
>>>>>>>>>> this case, kafka) rather than building a generic library
>> that
>>>>> suits
>>>>>> all
>>>>>>>>>> (metrics-core) but I would like us to have answers to the
>>>>> questions
>>>>>>>>>> above
>>>>>>>>>> and be prepared before we proceed to support this with the
>>>>> producer
>>>>>>>>>> rewrite.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Naturally we are all considering exactly these things, that is
>>>>>> exactly the
>>>>>>>>> reason I started the thread.
>>>>>>>>> 
>>>>>>>>> -Jay
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 2/11/14 6:28 PM, "Jun Rao" <jun...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Thanks for the detailed write-up. It's well thought
>> through.
>>> A
>>>>> few
>>>>>>>>>>> comments:
>>>>>>>>>>> 
>>>>>>>>>>> 1. I have a couple of concerns on the percentiles. The
>> first
>>>>> issue
>>>>>> is
>>>>>>>>>> that
>>>>>>>>>>> It requires the user to know the value range. Since the
>> range
>>>> for
>>>>>>>>>> things
>>>>>>>>>>> like message size (in millions) is quite different from
>> those
>>>>> like
>>>>>>>>>> request
>>>>>>>>>>> time (less than 100), it's going to be hard to pick a good
>>>> global
>>>>>>>>>> default
>>>>>>>>>>> range. Different apps could be dealing with different
>> message
>>>>>> size. So
>>>>>>>>>>> they
>>>>>>>>>>> probably will have to customize the range. Another issue is
>>>> that
>>>>>> it can
>>>>>>>>>>> only report values at the bucket boundaries. So, if you
>> have
>>>> 1000
>>>>>>>>>> buckets
>>>>>>>>>>> and a value range of 1 million, you will only see 1000
>>> possible
>>>>>> values
>>>>>>>>>> as
>>>>>>>>>>> the quantile, which is probably too sparse. The
>>> implementation
>>>> of
>>>>>>>>>>> histogram
>>>>>>>>>>> in metrics-core keeps a fix size of samples, which avoids
>>> both
>>>>>> issues.
>>>>>>>>>>> 
>>>>>>>>>>> 2. We need to document the 3-part metrics names better
>> since
>>>> it's
>>>>>> not
>>>>>>>>>>> obvious what the convention is. Also, currently the name of
>>> the
>>>>>> sensor
>>>>>>>>>> and
>>>>>>>>>>> the metrics defined in it are independent. Would it make
>>> sense
>>>> to
>>>>>> have
>>>>>>>>>> the
>>>>>>>>>>> sensor name be a prefix of the metric name?
>>>>>>>>>>> 
>>>>>>>>>>> Overall, this approach seems to be cleaner than
>> metrics-core
>>> by
>>>>>>>>>> decoupling
>>>>>>>>>>> measuring and reporting. The main benefit of metrics-core
>>> seems
>>>>> to
>>>>>> be
>>>>>>>>>> the
>>>>>>>>>>> existing reporters. Since not that many people voted for
>>>>>> metrics-core,
>>>>>>>>>> I
>>>>>>>>>>> am
>>>>>>>>>>> ok with going with the new implementation. My only
>>>> recommendation
>>>>>> is to
>>>>>>>>>>> address the concern on percentiles.
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> 
>>>>>>>>>>> Jun
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Feb 6, 2014 at 12:51 PM, Jay Kreps <
>>>> jay.kr...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hey guys,
>>>>>>>>>>>> 
>>>>>>>>>>>> I wanted to kick off a quick discussion of metrics with
>>>> respect
>>>>>> to
>>>>>>>>>> the
>>>>>>>>>>>> new
>>>>>>>>>>>> producer and consumer (and potentially the server).
>>>>>>>>>>>> 
>>>>>>>>>>>> At a high level I think there are three approaches we
>> could
>>>>> take:
>>>>>>>>>>>> 1. Plain vanilla JMX
>>>>>>>>>>>> 2. Use Coda Hale (AKA Yammer) Metrics
>>>>>>>>>>>> 3. Do our own metrics (with JMX as one output)
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Has the advantage that JMX is the most commonly used
>>> java
>>>>>> thing
>>>>>>>>>> and
>>>>>>>>>>>> plugs in reasonably to most metrics systems. JMX is
>>> included
>>>> in
>>>>>> the
>>>>>>>>>> JDK
>>>>>>>>>>>> so
>>>>>>>>>>>> it doesn't impose any additional dependencies on clients.
>>> It
>>>>> has
>>>>>> the
>>>>>>>>>>>> disadvantage that plain vanilla JMX is a pain to use. We
>>>> would
>>>>>> need a
>>>>>>>>>>>> bunch
>>>>>>>>>>>> of helper code for maintaining counters to make this
>>>>> reasonable.
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. Coda Hale metrics is pretty good and broadly used. It
>>>>>> supports JMX
>>>>>>>>>>>> output as well as direct output to many other types of
>>>> systems.
>>>>>> The
>>>>>>>>>>>> primary
>>>>>>>>>>>> downside we have had with Coda Hale has to do with the
>>>> clients
>>>>>> and
>>>>>>>>>>>> library
>>>>>>>>>>>> incompatibilities. We are currently on an older more
>>> popular
>>>>>> version.
>>>>>>>>>>>> The
>>>>>>>>>>>> newer version is a rewrite of the APIs and is
>> incompatible.
>>>>>>>>>> Originally
>>>>>>>>>>>> these were totally incompatible and people had to choose
>>> one
>>>> or
>>>>>> the
>>>>>>>>>>>> other.
>>>>>>>>>>>> I think that has been improved so now the new version is
>> a
>>>>>> totally
>>>>>>>>>>>> different package. But even in this case you end up with
>>> both
>>>>>>>>>> versions
>>>>>>>>>>>> if
>>>>>>>>>>>> you use Kafka and we are on a different version than you
>>>> which
>>>>> is
>>>>>>>>>> going
>>>>>>>>>>>> to
>>>>>>>>>>>> be pretty inconvenient.
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. Doing our own has the downside of potentially
>>> reinventing
>>>>> the
>>>>>>>>>> wheel,
>>>>>>>>>>>> and
>>>>>>>>>>>> potentially needing to work out any bugs in our code. The
>>>>> upsides
>>>>>>>>>> would
>>>>>>>>>>>> depend on the how good the reinvention was. As it
>> happens I
>>>>> did a
>>>>>>>>>> quick
>>>>>>>>>>>> (~900 loc) version of a metrics library that is under
>>>>>>>>>>>> kafka.common.metrics.
>>>>>>>>>>>> I think it has some advantages over the Yammer metrics
>>>> package
>>>>>> for
>>>>>>>>>> our
>>>>>>>>>>>> usage beyond just not causing incompatibilities. I will
>>>>> describe
>>>>>> this
>>>>>>>>>>>> code
>>>>>>>>>>>> so we can discuss the pros and cons. Although I favor
>> this
>>>>>> approach I
>>>>>>>>>>>> have
>>>>>>>>>>>> no emotional attachment and wouldn't be too sad if I
>> ended
>>> up
>>>>>>>>>> deleting
>>>>>>>>>>>> it.
>>>>>>>>>>>> Here are javadocs for this code, though I haven't written
>>>> much
>>>>>>>>>>>> documentation yet since I might end up deleting it:
>>>>>>>>>>>> 
>>>>>>>>>>>> Here is a quick overview of this library.
>>>>>>>>>>>> 
>>>>>>>>>>>> There are three main public interfaces:
>>>>>>>>>>>>  Metrics - This is a repository of metrics being
>> tracked.
>>>>>>>>>>>>  Metric - A single, named numerical value being measured
>>>>> (i.e. a
>>>>>>>>>>>> counter).
>>>>>>>>>>>>  Sensor - This is a thing that records values and
>> updates
>>>> zero
>>>>>> or
>>>>>>>>>> more
>>>>>>>>>>>> metrics
>>>>>>>>>>>> 
>>>>>>>>>>>> So let's say we want to track three values about message
>>>> sizes;
>>>>>>>>>>>> specifically say we want to record the average, the
>>> maximum,
>>>>> the
>>>>>>>>>> total
>>>>>>>>>>>> rate
>>>>>>>>>>>> of bytes being sent, and a count of messages. Then we
>> would
>>>> do
>>>>>>>>>> something
>>>>>>>>>>>> like this:
>>>>>>>>>>>> 
>>>>>>>>>>>>   // setup code
>>>>>>>>>>>>   Metrics metrics = new Metrics(); // this is a global
>>>>>> "singleton"
>>>>>>>>>>>>   Sensor sensor =
>>>>>> metrics.sensor("kafka.producer.message.sizes");
>>>>>>>>>>>>   sensor.add("kafka.producer.message-size.avg", new
>>> Avg());
>>>>>>>>>>>>   sensor.add("kafka.producer.message-size.max", new
>>> Max());
>>>>>>>>>>>>   sensor.add("kafka.producer.bytes-sent-per-sec", new
>>>> Rate());
>>>>>>>>>>>>   sensor.add("kafka.producer.message-count", new
>> Count());
>>>>>>>>>>>> 
>>>>>>>>>>>>   // now when we get a message we do this
>>>>>>>>>>>>   sensor.record(messageSize);
>>>>>>>>>>>> 
>>>>>>>>>>>> The above code creates the global metrics repository,
>>>> creates a
>>>>>>>>>> single
>>>>>>>>>>>> Sensor, and defines 5 named metrics that are updated by
>>> that
>>>>>> Sensor.
>>>>>>>>>>>> 
>>>>>>>>>>>> Like Yammer Metrics (YM) I allow you to plug in
>>> "reporters",
>>>>>>>>>> including a
>>>>>>>>>>>> JMX reporter. Unlike the Coda Hale JMX reporter the
>>> reporter
>>>> I
>>>>>> have
>>>>>>>>>> keys
>>>>>>>>>>>> off the metric names not the Sensor names, which I think
>> is
>>>> an
>>>>>>>>>>>> improvement--I just use the convention that the last
>>> portion
>>>> of
>>>>>> the
>>>>>>>>>>>> name is
>>>>>>>>>>>> the attribute name, the second to last is the mbean name,
>>> and
>>>>> the
>>>>>>>>>> rest
>>>>>>>>>>>> is
>>>>>>>>>>>> the package. So in the above example there is a producer
>>>> mbean
>>>>>> that
>>>>>>>>>> has
>>>>>>>>>>>> a
>>>>>>>>>>>> avg and max attribute and a producer mbean that has a
>>>>>>>>>> bytes-sent-per-sec
>>>>>>>>>>>> and message-count attribute. This is nice because you can
>>>>>> logically
>>>>>>>>>>>> group
>>>>>>>>>>>> the values reported irrespective of where in the program
>>> they
>>>>> are
>>>>>>>>>>>> computed--that is an mbean can logically group attributes
>>>>>> computed
>>>>>>>>>> off
>>>>>>>>>>>> different sensors. This means you can report values by
>>>> logical
>>>>>>>>>>>> subsystem.
>>>>>>>>>>>> 
>>>>>>>>>>>> I also allow the concept of hierarchical Sensors which I
>>>> think
>>>>>> is a
>>>>>>>>>> good
>>>>>>>>>>>> convenience. I have noticed a common pattern in systems
>>> where
>>>>> you
>>>>>>>>>> need
>>>>>>>>>>>> to
>>>>>>>>>>>> roll up the same values along different dimensions. An
>>> simple
>>>>>>>>>> example is
>>>>>>>>>>>> metrics about qps, data rate, etc on the broker. These we
>>>> want
>>>>> to
>>>>>>>>>>>> capture
>>>>>>>>>>>> in aggregate, but also broken down by topic-id. You can
>> do
>>>> this
>>>>>>>>>> purely
>>>>>>>>>>>> by
>>>>>>>>>>>> defining the sensor hierarchy:
>>>>>>>>>>>> Sensor allSizes = metrics.sensor("kafka.producer.sizes");
>>>>>>>>>>>> Sensor topicSizes = metrics.sensor("kafka.producer." +
>>> topic
>>>> +
>>>>>>>>>>>> ".sizes",
>>>>>>>>>>>> allSizes);
>>>>>>>>>>>> Now each actual update will go to the appropriate
>>> topicSizes
>>>>>> sensor
>>>>>>>>>>>> (based
>>>>>>>>>>>> on the topic name), but allSizes metrics will get updated
>>>> too.
>>>>> I
>>>>>> also
>>>>>>>>>>>> support multiple parents for each sensor as well as
>>> multiple
>>>>>> layers
>>>>>>>>>> of
>>>>>>>>>>>> hiearchy, so you can define a more elaborate DAG of
>>> sensors.
>>>> An
>>>>>>>>>> example
>>>>>>>>>>>> of
>>>>>>>>>>>> how this would be useful is if you wanted to record your
>>>>> metrics
>>>>>>>>>> broken
>>>>>>>>>>>> down by topic AND client id as well as the global
>>> aggregate.
>>>>>>>>>>>> 
>>>>>>>>>>>> Each metric can take a configurable Quota value which
>>> allows
>>>> us
>>>>>> to
>>>>>>>>>> limit
>>>>>>>>>>>> the maximum value of that sensor. This is intended for
>> use
>>> on
>>>>> the
>>>>>>>>>>>> server as
>>>>>>>>>>>> part of our Quota implementation. The way this works is
>>> that
>>>>> you
>>>>>>>>>> record
>>>>>>>>>>>> metrics as usual:
>>>>>>>>>>>>   mySensor.record(42.0)
>>>>>>>>>>>> However if this event occurance causes one of the metrics
>>> to
>>>>>> exceed
>>>>>>>>>> its
>>>>>>>>>>>> maximum allowable value (the quota) this call will throw
>> a
>>>>>>>>>>>> QuotaViolationException. The cool thing about this is
>> that
>>> it
>>>>>> means
>>>>>>>>>> we
>>>>>>>>>>>> can
>>>>>>>>>>>> define quotas on anything we capture metrics for, which I
>>>> think
>>>>>> is
>>>>>>>>>>>> pretty
>>>>>>>>>>>> cool.
>>>>>>>>>>>> 
>>>>>>>>>>>> Another question is how to handle windowing of the
>> values?
>>>>>> Metrics
>>>>>>>>>> want
>>>>>>>>>>>> to
>>>>>>>>>>>> record the "current" value, but the definition of current
>>> is
>>>>>>>>>> inherently
>>>>>>>>>>>> nebulous. A few of the obvious gotchas are that if you
>>> define
>>>>>>>>>> "current"
>>>>>>>>>>>> to
>>>>>>>>>>>> be a number of events you can end up measuring an
>>> arbitrarily
>>>>>> long
>>>>>>>>>>>> window
>>>>>>>>>>>> of time if the event rate is low (e.g. you think you are
>>>>> getting
>>>>>> 50
>>>>>>>>>>>> messages/sec because that was the rate yesterday when all
>>>>> events
>>>>>>>>>>>> topped).
>>>>>>>>>>>> 
>>>>>>>>>>>> Here is how I approach this. All the metrics use the same
>>>>>> windowing
>>>>>>>>>>>> approach. We define a single window by a length of time
>> or
>>>>>> number of
>>>>>>>>>>>> values
>>>>>>>>>>>> (you can use either or both--if both the window ends when
>>>>>> *either*
>>>>>>>>>> the
>>>>>>>>>>>> time
>>>>>>>>>>>> bound or event bound is hit). The typical problem with
>> hard
>>>>>> window
>>>>>>>>>>>> boundaries is that at the beginning of the window you
>> have
>>> no
>>>>>> data
>>>>>>>>>> and
>>>>>>>>>>>> the
>>>>>>>>>>>> first few samples are too small to be a valid sample.
>>>> (Consider
>>>>>> if
>>>>>>>>>> you
>>>>>>>>>>>> were
>>>>>>>>>>>> keeping an avg and the first value in the window happens
>> to
>>>> be
>>>>>> very
>>>>>>>>>> very
>>>>>>>>>>>> high, if you check the avg at this exact time you will
>>>> conclude
>>>>>> the
>>>>>>>>>> avg
>>>>>>>>>>>> is
>>>>>>>>>>>> very high but on a sample size of one). One simple fix
>>> would
>>>> be
>>>>>> to
>>>>>>>>>>>> always
>>>>>>>>>>>> report the last complete window, however this is not
>>>>> appropriate
>>>>>> here
>>>>>>>>>>>> because (1) we want to drive quotas off it so it needs to
>>> be
>>>>>> current,
>>>>>>>>>>>> and
>>>>>>>>>>>> (2) since this is for monitoring you kind of care more
>>> about
>>>>> the
>>>>>>>>>> current
>>>>>>>>>>>> state. The ideal solution here would be to define a
>>> backwards
>>>>>> looking
>>>>>>>>>>>> sliding window from the present, but many statistics are
>>>>> actually
>>>>>>>>>> very
>>>>>>>>>>>> hard
>>>>>>>>>>>> to compute in this model without retaining all the values
>>>> which
>>>>>>>>>> would be
>>>>>>>>>>>> hopelessly inefficient. My solution to this is to keep a
>>>>>> configurable
>>>>>>>>>>>> number of windows (default is two) and combine them for
>> the
>>>>>> estimate.
>>>>>>>>>>>> So in
>>>>>>>>>>>> a two sample case depending on when you ask you have
>>> between
>>>>> one
>>>>>> and
>>>>>>>>>> two
>>>>>>>>>>>> complete samples worth of data to base the answer off of.
>>>>>> Provided
>>>>>>>>>> the
>>>>>>>>>>>> sample window is large enough to get a valid result this
>>>>>> satisfies
>>>>>>>>>> both
>>>>>>>>>>>> of
>>>>>>>>>>>> my criteria of incorporating the most recent data and
>>> having
>>>>>>>>>> reasonable
>>>>>>>>>>>> variance at all times.
>>>>>>>>>>>> 
>>>>>>>>>>>> Another approach is to use an exponential weighting
>> scheme
>>> to
>>>>>> combine
>>>>>>>>>>>> all
>>>>>>>>>>>> history but emphasize the recent past. I have not done
>> this
>>>> as
>>>>> it
>>>>>>>>>> has a
>>>>>>>>>>>> lot
>>>>>>>>>>>> of issues for practical operational metrics. I'd be happy
>>> to
>>>>>>>>>> elaborate
>>>>>>>>>>>> on
>>>>>>>>>>>> this if anyone cares...
>>>>>>>>>>>> 
>>>>>>>>>>>> The window size for metrics has a global default which
>> can
>>> be
>>>>>>>>>>>> overridden at
>>>>>>>>>>>> either the sensor or individual metric level.
>>>>>>>>>>>> 
>>>>>>>>>>>> In addition to these time series values the user can
>>> directly
>>>>>> expose
>>>>>>>>>>>> some
>>>>>>>>>>>> method of their choosing JMX-style by implementing the
>>>>> Measurable
>>>>>>>>>>>> interface
>>>>>>>>>>>> and registering that value. E.g.
>>>>>>>>>>>>  metrics.addMetric("my.metric", new Measurable() {
>>>>>>>>>>>>    public double measure(MetricConfg config, long now) {
>>>>>>>>>>>>       return this.calculateValueToExpose();
>>>>>>>>>>>>    }
>>>>>>>>>>>>  });
>>>>>>>>>>>> This is useful for exposing things like the accumulator
>>> free
>>>>>> memory.
>>>>>>>>>>>> 
>>>>>>>>>>>> The set of metrics is extensible, new metrics can be
>> added
>>> by
>>>>>> just
>>>>>>>>>>>> implementing the appropriate interfaces and registering
>>> with
>>>> a
>>>>>>>>>> sensor. I
>>>>>>>>>>>> implement the following metrics:
>>>>>>>>>>>>  total - the sum of all values from the given sensor
>>>>>>>>>>>>  count - a windowed count of values from the sensor
>>>>>>>>>>>>  avg - the sample average within the windows
>>>>>>>>>>>>  max - the max over the windows
>>>>>>>>>>>>  min - the min over the windows
>>>>>>>>>>>>  rate - the rate in the windows (e.g. the total or count
>>>>>> divided by
>>>>>>>>>> the
>>>>>>>>>>>> ellapsed time)
>>>>>>>>>>>>  percentiles - a collection of percentiles computed over
>>> the
>>>>>> window
>>>>>>>>>>>> 
>>>>>>>>>>>> My approach to percentiles is a little different from the
>>>>> yammer
>>>>>>>>>> metrics
>>>>>>>>>>>> package. My complaint about the yammer metrics approach
>> is
>>>> that
>>>>>> it
>>>>>>>>>> uses
>>>>>>>>>>>> rather expensive sampling and uses kind of a lot of
>> memory
>>> to
>>>>>> get a
>>>>>>>>>>>> reasonable sample. This is problematic for per-topic
>>>>>> measurements.
>>>>>>>>>>>> 
>>>>>>>>>>>> Instead I use a fixed range for the histogram (e.g. 0.0
>> to
>>>>>> 30000.0)
>>>>>>>>>>>> which
>>>>>>>>>>>> directly allows you to specify the desired memory use.
>> Any
>>>>> value
>>>>>>>>>> below
>>>>>>>>>>>> the
>>>>>>>>>>>> minimum is recorded as -Infinity and any value above the
>>>>> maximum
>>>>>> as
>>>>>>>>>>>> +Infinity. I think this is okay as all metrics have an
>>>> expected
>>>>>> range
>>>>>>>>>>>> except for latency which can be arbitrarily large, but
>> for
>>>> very
>>>>>> high
>>>>>>>>>>>> latency there is no need to model it exactly (e.g. 30
>>>> seconds +
>>>>>>>>>> really
>>>>>>>>>>>> is
>>>>>>>>>>>> effectively infinite). Within the range values are
>> recorded
>>>> in
>>>>>>>>>> buckets
>>>>>>>>>>>> which can be either fixed width or increasing width. The
>>>>>> increasing
>>>>>>>>>>>> width
>>>>>>>>>>>> is analogous to the idea of significant figures, that is
>> if
>>>>> your
>>>>>>>>>> value
>>>>>>>>>>>> is
>>>>>>>>>>>> in the range 0-10 you might want to be accurate to within
>>>> 1ms,
>>>>>> but if
>>>>>>>>>>>> it is
>>>>>>>>>>>> 20000 there is no need to be so accurate. I implemented a
>>>>> linear
>>>>>>>>>> bucket
>>>>>>>>>>>> size where the Nth bucket has width proportional to N. An
>>>>>> exponential
>>>>>>>>>>>> bucket size would also be sensible and could likely be
>>>> derived
>>>>>>>>>> directly
>>>>>>>>>>>> from the floating point representation of a the value.
>>>>>>>>>>>> 
>>>>>>>>>>>> I'd like to get some feedback on this metrics code and
>>> make a
>>>>>>>>>> decision
>>>>>>>>>>>> on
>>>>>>>>>>>> whether we want to use it before I actually go ahead and
>>> add
>>>>> all
>>>>>> the
>>>>>>>>>>>> instrumentation in the code (otherwise I'll have to redo
>> it
>>>> if
>>>>> we
>>>>>>>>>> switch
>>>>>>>>>>>> approaches). So the next topic of discussion will be
>> which
>>>>> actual
>>>>>>>>>>>> metrics
>>>>>>>>>>>> to add.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Jay
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Metrics in new producer

Reply via email to