Jay - I was thinking of a pure stub rather than just wrapping Kafka metrics
in a Coda gauge.  I'd like the Timers, Meters etc to still be Coda meters -
that way the windows, exponential decays, etc are comparable to the rest of
the Coda metrics in our applications. At the same time, I don't want to
force Coda timers (or any other timers) on an app that won't make good use
of them.

Thanks again, C


On Sat, Feb 22, 2014 at 9:25 AM, Martin Kleppmann
<mkleppm...@linkedin.com>wrote:

> Not sure if you want yet another opinion added to the pile -- but since I
> had a similar problem on another project recently, I thought I'd weigh in.
> (On that project we were originally using Coda's library, but then switched
> to rolling our own metrics implementation because we needed to do a few
> things differently.)
>
> 1. Problems we encountered with Coda's library: it uses an
> exponentially-weighted moving average (EMWA) for rates (eg. messages/sec),
> and exponentially biased reservoir sampling for histograms (percentiles,
> averages). Those methods of calculation work well for events with a
> consistently high volume, but they give strange and misleading results for
> events that are bursty or rare (eg error rates). We found that a fixed-size
> window gives more predictable, easier-to-interpret results.
>
> 2. In defence of Coda's library, I think its histogram implementation is a
> good trade-off of memory for accuracy; I'm not totally convinced that your
> proposal (counts of events in a fixed set of buckets) would be much better.
> Would have to do some math to work out the expected accuracy in each case.
> The reservoir sampling can be configured to use a smaller sample if the
> default of 1028 samples is too expensive. Reservoir sampling also has the
> advantage that you don't need to hard-code a bucket distribution.
>
> 3. Quotas are an interesting use case. However, I'm not wild about using a
> QuotaViolationException for control flow -- I think an explicit conditional
> would be nicer than having to catch an exception. One question in that
> context: if a quota is exceeded, do you still want to count the event
> towards the metric, or do you want to stop counting it until the quota is
> replenished? The answer may depend on the particular metric.
>
> 4. If you decide to go with Coda's library, I would advocate isolating the
> dependency into a separate module and using it via a facade -- somewhat
> like using SLF4J instead of Log4j directly. It's ok for Coda's library to
> be the default metrics implementation, but it should be easy to swap it out
> for something different in case someone has a version conflict or differing
> requirements. The facade should be at a low level (individual events), not
> at the reporter level (which deals with pre-aggregated values, and is
> already pluggable).
>
> 5. If it's useful, I can probably contribute my simple (but imho
> effective) metrics library, for embedding into Kafka. It uses reservoir
> sampling for percentiles, like Coda's library, but uses a fixed-size window
> instead of an exponential bias, which avoids weird behaviour on bursty
> metrics.
>
> In summary, I would advocate one of the following approaches:
> - Coda Hale library via facade (allowing it to be swapped for something
> else), or
> - Own metrics implementation, provided that we have confidence in its
> implementation of percentiles.
>
> Martin
>
>
> On 22 Feb 2014, at 01:06, Jay Kreps <jay.kr...@gmail.com> wrote:
> > Hey guys,
> >
> > Just picking up this thread again. I do want to drive a conclusion as I
> > will run out of work to do on the producer soon and will need to add
> > metrics of some sort. We can vote on it, but I'm not sure if we actually
> > got everything discussed.
> >
> > Joel, I wasn't fully sure how to interpret your comment. I think you are
> > saying you are cool with the new metrics package as long as it really is
> > better. Do you have any comment on whether you think the benefits I
> > outlined are worth it? I agree with you that we could hold off on a
> second
> > repo until someone else would actually want to use our code.
> >
> > Jun, I'm not averse to doing a sampling-based histogram and doing some
> > comparison between the two approaches if you think this approach is
> > otherwise better.
> >
> > Sriram, originally I thought you preferred just sticking to Coda Hale,
> but
> > after your follow-up email I wasn't really sure...
> >
> > Joe/Clark, yes this code allows pluggable reporting so you could have a
> > metrics reporter that just wraps each metric in a Coda Hale Gauge if that
> > is useful. Though obviously if enough people were doing that I would
> think
> > it would be worth just using the Coda Hale package directly...
> >
> > -Jay
> >
> >
> >
> >
> > On Thu, Feb 13, 2014 at 3:34 PM, Clark Breyman <cl...@breyman.com>
> wrote:
> >
> >> Not requiring the client to link Coda/Yammer metrics sounds like a
> >> compelling reason to pivot to new interfaces. If that's the agreed
> >> direction, I'm hoping that we'd get the choice of backend to provide
> (e.g.
> >> facade on Yammer metrics for those with an investment in that) rather
> than
> >> force the new backend.  Having a metrics factory seems better for this
> than
> >> directly instantiating the singleton registry.
> >>
> >>
> >> On Thu, Feb 13, 2014 at 2:39 PM, Joe Stein <joe.st...@stealth.ly>
> wrote:
> >>
> >>> Can we leave metrics and have multiple supported KafkaMetricsGroup
> >>> implementing a yammer based implementation?
> >>>
> >>> ProducerRequestStats with your configured analytics group?
> >>>
> >>> On Thu, Feb 13, 2014 at 11:37 AM, Jay Kreps <jay.kr...@gmail.com>
> wrote:
> >>>
> >>>> I think we discussed the scala/java stuff more fully previously.
> >>>> Essentially the client is embedded everywhere. Scala is very
> >> incompatible
> >>>> with itself so this makes it very hard to use for people using
> anything
> >>>> else in scala. Also Scala stack traces are very confusing. Basically
> we
> >>>> thought plain java code would be a lot easier for people to use. Even
> >> if
> >>>> Scala is more fun to write, that isn't really what we are optimizing
> >> for.
> >>>>
> >>>> -Jay
> >>>>
> >>>>
> >>>> On Thu, Feb 13, 2014 at 8:09 AM, S Ahmed <sahmed1...@gmail.com>
> wrote:
> >>>>
> >>>>> Jay, pretty impressive how you just write a 'quick version' like that
> >>> :)
> >>>>> Not to get off-topic but why didn't you write this in scala?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Feb 12, 2014 at 6:54 PM, Joel Koshy <jjkosh...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> I have not had a chance to review the new metrics code and its
> >>>>>> features carefully (apart from your write-up), but here are my
> >>> general
> >>>>>> thoughts:
> >>>>>>
> >>>>>> Implementing a metrics package correctly is difficult; more so for
> >>>>>> people like me, because I'm not a statistician.  However, if this
> >> new
> >>>>>> package: {(i) functions correctly (and we need to define and prove
> >>>>>> correctness), (ii) is easy to use, (iii) serves all our current and
> >>>>>> anticipated monitoring needs, (iv) is not overly complex that it
> >>>>>> becomes a burden to maintain and we are better of with an available
> >>>>>> library;} then I think it makes sense to embed it and use it within
> >>>>>> the Kafka code. The main wins are: (i) predictability (no changing
> >>>>>> APIs and intimate knowledge of the code) and (ii) control with
> >>> respect
> >>>>>> to both functionality (e.g., there are hard-coded decay constants
> >> in
> >>>>>> metrics-core 2.x) and correctness (i.e., if we find a bug in the
> >>>>>> metrics package we have to submit a pull request and wait for it to
> >>>>>> become mainstream).  I'm not sure it would help very much to pull
> >> it
> >>>>>> into a separate repo because that could potentially annul these
> >>>>>> benefits.
> >>>>>>
> >>>>>> Joel
> >>>>>>
> >>>>>> On Wed, Feb 12, 2014 at 02:50:43PM -0800, Jay Kreps wrote:
> >>>>>>> Sriram,
> >>>>>>>
> >>>>>>> Makes sense. I am cool moving this stuff into its own repo if
> >>> people
> >>>>>> think
> >>>>>>> that is better. I'm not sure it would get much contribution but
> >>> when
> >>>> I
> >>>>>>> started messing with this I did have a lot of grand ideas of
> >> making
> >>>>>> adding
> >>>>>>> metrics to a sensor dynamic so you could add more stuff in
> >>>>> real-time(via
> >>>>>>> jmx, say) and/or externalize all your metrics and config to a
> >>>> separate
> >>>>>> file
> >>>>>>> like log4j with only the points of instrumentation hard-coded.
> >>>>>>>
> >>>>>>> -Jay
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Feb 12, 2014 at 2:07 PM, Sriram Subramanian <
> >>>>>>> srsubraman...@linkedin.com> wrote:
> >>>>>>>
> >>>>>>>> I am actually neutral to this change. I found the replies were
> >>> more
> >>>>>>>> towards the implementation and features so far. I would like
> >> the
> >>>>>> community
> >>>>>>>> to think about the questions below before making a decision. My
> >>>>>> opinion on
> >>>>>>>> this is that it has potential to be its own project and it
> >> would
> >>>>>> attract
> >>>>>>>> developers who are specifically interested in contributing to
> >>>>> metrics.
> >>>>>> I
> >>>>>>>> am skeptical that the Kafka contributors would focus on
> >> improving
> >>>>> this
> >>>>>>>> library (apart from bug fixes) instead of
> >> developing/contributing
> >>>> to
> >>>>>> other
> >>>>>>>> core pieces. It would be useful to continue and keep it
> >> decoupled
> >>>>> from
> >>>>>>>> rest of Kafka (if it resides in the Kafka code base.) so that
> >> we
> >>>> can
> >>>>>> move
> >>>>>>>> it out anytime to its own project.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 2/12/14 1:21 PM, "Jay Kreps" <jay.kr...@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>>> Hey Sriram,
> >>>>>>>>>
> >>>>>>>>> Not sure if these are actually meant as questions or more
> >> veiled
> >>>>>> comments.
> >>>>>>>>> In an case I tried to give my 2 cents inline.
> >>>>>>>>>
> >>>>>>>>> On Tue, Feb 11, 2014 at 11:12 PM, Sriram Subramanian <
> >>>>>>>>> srsubraman...@linkedin.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> I think answering the questions below would help to make a
> >>>> better
> >>>>>>>>>> decision. I am all for writing better code and having
> >> superior
> >>>>>>>>>> functionalities but it is worth thinking about stuff outside
> >>>> just
> >>>>>> code
> >>>>>>>>>> in
> >>>>>>>>>> this case -
> >>>>>>>>>>
> >>>>>>>>>> 1. Does metric form a core piece of kafka? Does it help
> >> kafka
> >>>>>> greatly in
> >>>>>>>>>> providing better core functionalities? I would always like a
> >>>>>> project to
> >>>>>>>>>> do
> >>>>>>>>>> one thing really well. Metrics is a non trivial amount of
> >>> code.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Metrics are obviously important, and obviously improving our
> >>>> metrics
> >>>>>>>>> system
> >>>>>>>>> would be good. That said this may or may not be better, and
> >> even
> >>>> if
> >>>>>> it is
> >>>>>>>>> better that betterness might not outweigh other
> >> considerations.
> >>>> That
> >>>>>> is
> >>>>>>>>> what we are discussing.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> 2. Does it make sense to be part of Kafka or its own
> >> project?
> >>> If
> >>>>>> this
> >>>>>>>>>> metrics library has the potential to be better than
> >>>> metrics-core,
> >>>>> I
> >>>>>>>>>> would
> >>>>>>>>>> be interested in other projects take advantage of it.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> It could be either.
> >>>>>>>>>
> >>>>>>>>> 3. Can Kafka maintain this library as new members join and old
> >>>>> members
> >>>>>>>>>> leave? Would this be a piece of code that no one (in Kafka)
> >> in
> >>>> the
> >>>>>>>>>> future
> >>>>>>>>>> spends time improving if the original author left?
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I am not going anywhere in the near term, but if I did, yes,
> >>> this
> >>>>>> would be
> >>>>>>>>> like any other code we have. As with yammer metrics or any
> >> other
> >>>>> code
> >>>>>> at
> >>>>>>>>> that point we would either use it as is or someone would
> >> improve
> >>>> it.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> 4. Does it affect the schedule of producer rewrite? This
> >> needs
> >>>> its
> >>>>>> own
> >>>>>>>>>> stabilization and modification to existing metric dashboards
> >>> if
> >>>>> the
> >>>>>>>>>> format
> >>>>>>>>>> is changed. Many times such cost are not factored in and a
> >>>> project
> >>>>>> loses
> >>>>>>>>>> time before realizing the extra time required to make a
> >>> library
> >>>> as
> >>>>>> this
> >>>>>>>>>> operational.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Probably not. The metrics are going to change regardless of
> >>>> whether
> >>>>>> we use
> >>>>>>>>> the same library or not. If we think this is better I don't
> >> mind
> >>>>>> putting
> >>>>>>>>> in
> >>>>>>>>> a little extra effort to get there.
> >>>>>>>>>
> >>>>>>>>> Irrespective I think this is probably not the right thing to
> >>>>> optimize
> >>>>>> for.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> I am sure we can do better when we write code to a specific
> >>> use
> >>>>>> case (in
> >>>>>>>>>> this case, kafka) rather than building a generic library
> >> that
> >>>>> suits
> >>>>>> all
> >>>>>>>>>> (metrics-core) but I would like us to have answers to the
> >>>>> questions
> >>>>>>>>>> above
> >>>>>>>>>> and be prepared before we proceed to support this with the
> >>>>> producer
> >>>>>>>>>> rewrite.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Naturally we are all considering exactly these things, that is
> >>>>>> exactly the
> >>>>>>>>> reason I started the thread.
> >>>>>>>>>
> >>>>>>>>> -Jay
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> On 2/11/14 6:28 PM, "Jun Rao" <jun...@gmail.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Thanks for the detailed write-up. It's well thought
> >> through.
> >>> A
> >>>>> few
> >>>>>>>>>>> comments:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. I have a couple of concerns on the percentiles. The
> >> first
> >>>>> issue
> >>>>>> is
> >>>>>>>>>> that
> >>>>>>>>>>> It requires the user to know the value range. Since the
> >> range
> >>>> for
> >>>>>>>>>> things
> >>>>>>>>>>> like message size (in millions) is quite different from
> >> those
> >>>>> like
> >>>>>>>>>> request
> >>>>>>>>>>> time (less than 100), it's going to be hard to pick a good
> >>>> global
> >>>>>>>>>> default
> >>>>>>>>>>> range. Different apps could be dealing with different
> >> message
> >>>>>> size. So
> >>>>>>>>>>> they
> >>>>>>>>>>> probably will have to customize the range. Another issue is
> >>>> that
> >>>>>> it can
> >>>>>>>>>>> only report values at the bucket boundaries. So, if you
> >> have
> >>>> 1000
> >>>>>>>>>> buckets
> >>>>>>>>>>> and a value range of 1 million, you will only see 1000
> >>> possible
> >>>>>> values
> >>>>>>>>>> as
> >>>>>>>>>>> the quantile, which is probably too sparse. The
> >>> implementation
> >>>> of
> >>>>>>>>>>> histogram
> >>>>>>>>>>> in metrics-core keeps a fix size of samples, which avoids
> >>> both
> >>>>>> issues.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. We need to document the 3-part metrics names better
> >> since
> >>>> it's
> >>>>>> not
> >>>>>>>>>>> obvious what the convention is. Also, currently the name of
> >>> the
> >>>>>> sensor
> >>>>>>>>>> and
> >>>>>>>>>>> the metrics defined in it are independent. Would it make
> >>> sense
> >>>> to
> >>>>>> have
> >>>>>>>>>> the
> >>>>>>>>>>> sensor name be a prefix of the metric name?
> >>>>>>>>>>>
> >>>>>>>>>>> Overall, this approach seems to be cleaner than
> >> metrics-core
> >>> by
> >>>>>>>>>> decoupling
> >>>>>>>>>>> measuring and reporting. The main benefit of metrics-core
> >>> seems
> >>>>> to
> >>>>>> be
> >>>>>>>>>> the
> >>>>>>>>>>> existing reporters. Since not that many people voted for
> >>>>>> metrics-core,
> >>>>>>>>>> I
> >>>>>>>>>>> am
> >>>>>>>>>>> ok with going with the new implementation. My only
> >>>> recommendation
> >>>>>> is to
> >>>>>>>>>>> address the concern on percentiles.
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jun
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Feb 6, 2014 at 12:51 PM, Jay Kreps <
> >>>> jay.kr...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hey guys,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I wanted to kick off a quick discussion of metrics with
> >>>> respect
> >>>>>> to
> >>>>>>>>>> the
> >>>>>>>>>>>> new
> >>>>>>>>>>>> producer and consumer (and potentially the server).
> >>>>>>>>>>>>
> >>>>>>>>>>>> At a high level I think there are three approaches we
> >> could
> >>>>> take:
> >>>>>>>>>>>> 1. Plain vanilla JMX
> >>>>>>>>>>>> 2. Use Coda Hale (AKA Yammer) Metrics
> >>>>>>>>>>>> 3. Do our own metrics (with JMX as one output)
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. Has the advantage that JMX is the most commonly used
> >>> java
> >>>>>> thing
> >>>>>>>>>> and
> >>>>>>>>>>>> plugs in reasonably to most metrics systems. JMX is
> >>> included
> >>>> in
> >>>>>> the
> >>>>>>>>>> JDK
> >>>>>>>>>>>> so
> >>>>>>>>>>>> it doesn't impose any additional dependencies on clients.
> >>> It
> >>>>> has
> >>>>>> the
> >>>>>>>>>>>> disadvantage that plain vanilla JMX is a pain to use. We
> >>>> would
> >>>>>> need a
> >>>>>>>>>>>> bunch
> >>>>>>>>>>>> of helper code for maintaining counters to make this
> >>>>> reasonable.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2. Coda Hale metrics is pretty good and broadly used. It
> >>>>>> supports JMX
> >>>>>>>>>>>> output as well as direct output to many other types of
> >>>> systems.
> >>>>>> The
> >>>>>>>>>>>> primary
> >>>>>>>>>>>> downside we have had with Coda Hale has to do with the
> >>>> clients
> >>>>>> and
> >>>>>>>>>>>> library
> >>>>>>>>>>>> incompatibilities. We are currently on an older more
> >>> popular
> >>>>>> version.
> >>>>>>>>>>>> The
> >>>>>>>>>>>> newer version is a rewrite of the APIs and is
> >> incompatible.
> >>>>>>>>>> Originally
> >>>>>>>>>>>> these were totally incompatible and people had to choose
> >>> one
> >>>> or
> >>>>>> the
> >>>>>>>>>>>> other.
> >>>>>>>>>>>> I think that has been improved so now the new version is
> >> a
> >>>>>> totally
> >>>>>>>>>>>> different package. But even in this case you end up with
> >>> both
> >>>>>>>>>> versions
> >>>>>>>>>>>> if
> >>>>>>>>>>>> you use Kafka and we are on a different version than you
> >>>> which
> >>>>> is
> >>>>>>>>>> going
> >>>>>>>>>>>> to
> >>>>>>>>>>>> be pretty inconvenient.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3. Doing our own has the downside of potentially
> >>> reinventing
> >>>>> the
> >>>>>>>>>> wheel,
> >>>>>>>>>>>> and
> >>>>>>>>>>>> potentially needing to work out any bugs in our code. The
> >>>>> upsides
> >>>>>>>>>> would
> >>>>>>>>>>>> depend on the how good the reinvention was. As it
> >> happens I
> >>>>> did a
> >>>>>>>>>> quick
> >>>>>>>>>>>> (~900 loc) version of a metrics library that is under
> >>>>>>>>>>>> kafka.common.metrics.
> >>>>>>>>>>>> I think it has some advantages over the Yammer metrics
> >>>> package
> >>>>>> for
> >>>>>>>>>> our
> >>>>>>>>>>>> usage beyond just not causing incompatibilities. I will
> >>>>> describe
> >>>>>> this
> >>>>>>>>>>>> code
> >>>>>>>>>>>> so we can discuss the pros and cons. Although I favor
> >> this
> >>>>>> approach I
> >>>>>>>>>>>> have
> >>>>>>>>>>>> no emotional attachment and wouldn't be too sad if I
> >> ended
> >>> up
> >>>>>>>>>> deleting
> >>>>>>>>>>>> it.
> >>>>>>>>>>>> Here are javadocs for this code, though I haven't written
> >>>> much
> >>>>>>>>>>>> documentation yet since I might end up deleting it:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here is a quick overview of this library.
> >>>>>>>>>>>>
> >>>>>>>>>>>> There are three main public interfaces:
> >>>>>>>>>>>>  Metrics - This is a repository of metrics being
> >> tracked.
> >>>>>>>>>>>>  Metric - A single, named numerical value being measured
> >>>>> (i.e. a
> >>>>>>>>>>>> counter).
> >>>>>>>>>>>>  Sensor - This is a thing that records values and
> >> updates
> >>>> zero
> >>>>>> or
> >>>>>>>>>> more
> >>>>>>>>>>>> metrics
> >>>>>>>>>>>>
> >>>>>>>>>>>> So let's say we want to track three values about message
> >>>> sizes;
> >>>>>>>>>>>> specifically say we want to record the average, the
> >>> maximum,
> >>>>> the
> >>>>>>>>>> total
> >>>>>>>>>>>> rate
> >>>>>>>>>>>> of bytes being sent, and a count of messages. Then we
> >> would
> >>>> do
> >>>>>>>>>> something
> >>>>>>>>>>>> like this:
> >>>>>>>>>>>>
> >>>>>>>>>>>>   // setup code
> >>>>>>>>>>>>   Metrics metrics = new Metrics(); // this is a global
> >>>>>> "singleton"
> >>>>>>>>>>>>   Sensor sensor =
> >>>>>> metrics.sensor("kafka.producer.message.sizes");
> >>>>>>>>>>>>   sensor.add("kafka.producer.message-size.avg", new
> >>> Avg());
> >>>>>>>>>>>>   sensor.add("kafka.producer.message-size.max", new
> >>> Max());
> >>>>>>>>>>>>   sensor.add("kafka.producer.bytes-sent-per-sec", new
> >>>> Rate());
> >>>>>>>>>>>>   sensor.add("kafka.producer.message-count", new
> >> Count());
> >>>>>>>>>>>>
> >>>>>>>>>>>>   // now when we get a message we do this
> >>>>>>>>>>>>   sensor.record(messageSize);
> >>>>>>>>>>>>
> >>>>>>>>>>>> The above code creates the global metrics repository,
> >>>> creates a
> >>>>>>>>>> single
> >>>>>>>>>>>> Sensor, and defines 5 named metrics that are updated by
> >>> that
> >>>>>> Sensor.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Like Yammer Metrics (YM) I allow you to plug in
> >>> "reporters",
> >>>>>>>>>> including a
> >>>>>>>>>>>> JMX reporter. Unlike the Coda Hale JMX reporter the
> >>> reporter
> >>>> I
> >>>>>> have
> >>>>>>>>>> keys
> >>>>>>>>>>>> off the metric names not the Sensor names, which I think
> >> is
> >>>> an
> >>>>>>>>>>>> improvement--I just use the convention that the last
> >>> portion
> >>>> of
> >>>>>> the
> >>>>>>>>>>>> name is
> >>>>>>>>>>>> the attribute name, the second to last is the mbean name,
> >>> and
> >>>>> the
> >>>>>>>>>> rest
> >>>>>>>>>>>> is
> >>>>>>>>>>>> the package. So in the above example there is a producer
> >>>> mbean
> >>>>>> that
> >>>>>>>>>> has
> >>>>>>>>>>>> a
> >>>>>>>>>>>> avg and max attribute and a producer mbean that has a
> >>>>>>>>>> bytes-sent-per-sec
> >>>>>>>>>>>> and message-count attribute. This is nice because you can
> >>>>>> logically
> >>>>>>>>>>>> group
> >>>>>>>>>>>> the values reported irrespective of where in the program
> >>> they
> >>>>> are
> >>>>>>>>>>>> computed--that is an mbean can logically group attributes
> >>>>>> computed
> >>>>>>>>>> off
> >>>>>>>>>>>> different sensors. This means you can report values by
> >>>> logical
> >>>>>>>>>>>> subsystem.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I also allow the concept of hierarchical Sensors which I
> >>>> think
> >>>>>> is a
> >>>>>>>>>> good
> >>>>>>>>>>>> convenience. I have noticed a common pattern in systems
> >>> where
> >>>>> you
> >>>>>>>>>> need
> >>>>>>>>>>>> to
> >>>>>>>>>>>> roll up the same values along different dimensions. An
> >>> simple
> >>>>>>>>>> example is
> >>>>>>>>>>>> metrics about qps, data rate, etc on the broker. These we
> >>>> want
> >>>>> to
> >>>>>>>>>>>> capture
> >>>>>>>>>>>> in aggregate, but also broken down by topic-id. You can
> >> do
> >>>> this
> >>>>>>>>>> purely
> >>>>>>>>>>>> by
> >>>>>>>>>>>> defining the sensor hierarchy:
> >>>>>>>>>>>> Sensor allSizes = metrics.sensor("kafka.producer.sizes");
> >>>>>>>>>>>> Sensor topicSizes = metrics.sensor("kafka.producer." +
> >>> topic
> >>>> +
> >>>>>>>>>>>> ".sizes",
> >>>>>>>>>>>> allSizes);
> >>>>>>>>>>>> Now each actual update will go to the appropriate
> >>> topicSizes
> >>>>>> sensor
> >>>>>>>>>>>> (based
> >>>>>>>>>>>> on the topic name), but allSizes metrics will get updated
> >>>> too.
> >>>>> I
> >>>>>> also
> >>>>>>>>>>>> support multiple parents for each sensor as well as
> >>> multiple
> >>>>>> layers
> >>>>>>>>>> of
> >>>>>>>>>>>> hiearchy, so you can define a more elaborate DAG of
> >>> sensors.
> >>>> An
> >>>>>>>>>> example
> >>>>>>>>>>>> of
> >>>>>>>>>>>> how this would be useful is if you wanted to record your
> >>>>> metrics
> >>>>>>>>>> broken
> >>>>>>>>>>>> down by topic AND client id as well as the global
> >>> aggregate.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Each metric can take a configurable Quota value which
> >>> allows
> >>>> us
> >>>>>> to
> >>>>>>>>>> limit
> >>>>>>>>>>>> the maximum value of that sensor. This is intended for
> >> use
> >>> on
> >>>>> the
> >>>>>>>>>>>> server as
> >>>>>>>>>>>> part of our Quota implementation. The way this works is
> >>> that
> >>>>> you
> >>>>>>>>>> record
> >>>>>>>>>>>> metrics as usual:
> >>>>>>>>>>>>   mySensor.record(42.0)
> >>>>>>>>>>>> However if this event occurance causes one of the metrics
> >>> to
> >>>>>> exceed
> >>>>>>>>>> its
> >>>>>>>>>>>> maximum allowable value (the quota) this call will throw
> >> a
> >>>>>>>>>>>> QuotaViolationException. The cool thing about this is
> >> that
> >>> it
> >>>>>> means
> >>>>>>>>>> we
> >>>>>>>>>>>> can
> >>>>>>>>>>>> define quotas on anything we capture metrics for, which I
> >>>> think
> >>>>>> is
> >>>>>>>>>>>> pretty
> >>>>>>>>>>>> cool.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another question is how to handle windowing of the
> >> values?
> >>>>>> Metrics
> >>>>>>>>>> want
> >>>>>>>>>>>> to
> >>>>>>>>>>>> record the "current" value, but the definition of current
> >>> is
> >>>>>>>>>> inherently
> >>>>>>>>>>>> nebulous. A few of the obvious gotchas are that if you
> >>> define
> >>>>>>>>>> "current"
> >>>>>>>>>>>> to
> >>>>>>>>>>>> be a number of events you can end up measuring an
> >>> arbitrarily
> >>>>>> long
> >>>>>>>>>>>> window
> >>>>>>>>>>>> of time if the event rate is low (e.g. you think you are
> >>>>> getting
> >>>>>> 50
> >>>>>>>>>>>> messages/sec because that was the rate yesterday when all
> >>>>> events
> >>>>>>>>>>>> topped).
> >>>>>>>>>>>>
> >>>>>>>>>>>> Here is how I approach this. All the metrics use the same
> >>>>>> windowing
> >>>>>>>>>>>> approach. We define a single window by a length of time
> >> or
> >>>>>> number of
> >>>>>>>>>>>> values
> >>>>>>>>>>>> (you can use either or both--if both the window ends when
> >>>>>> *either*
> >>>>>>>>>> the
> >>>>>>>>>>>> time
> >>>>>>>>>>>> bound or event bound is hit). The typical problem with
> >> hard
> >>>>>> window
> >>>>>>>>>>>> boundaries is that at the beginning of the window you
> >> have
> >>> no
> >>>>>> data
> >>>>>>>>>> and
> >>>>>>>>>>>> the
> >>>>>>>>>>>> first few samples are too small to be a valid sample.
> >>>> (Consider
> >>>>>> if
> >>>>>>>>>> you
> >>>>>>>>>>>> were
> >>>>>>>>>>>> keeping an avg and the first value in the window happens
> >> to
> >>>> be
> >>>>>> very
> >>>>>>>>>> very
> >>>>>>>>>>>> high, if you check the avg at this exact time you will
> >>>> conclude
> >>>>>> the
> >>>>>>>>>> avg
> >>>>>>>>>>>> is
> >>>>>>>>>>>> very high but on a sample size of one). One simple fix
> >>> would
> >>>> be
> >>>>>> to
> >>>>>>>>>>>> always
> >>>>>>>>>>>> report the last complete window, however this is not
> >>>>> appropriate
> >>>>>> here
> >>>>>>>>>>>> because (1) we want to drive quotas off it so it needs to
> >>> be
> >>>>>> current,
> >>>>>>>>>>>> and
> >>>>>>>>>>>> (2) since this is for monitoring you kind of care more
> >>> about
> >>>>> the
> >>>>>>>>>> current
> >>>>>>>>>>>> state. The ideal solution here would be to define a
> >>> backwards
> >>>>>> looking
> >>>>>>>>>>>> sliding window from the present, but many statistics are
> >>>>> actually
> >>>>>>>>>> very
> >>>>>>>>>>>> hard
> >>>>>>>>>>>> to compute in this model without retaining all the values
> >>>> which
> >>>>>>>>>> would be
> >>>>>>>>>>>> hopelessly inefficient. My solution to this is to keep a
> >>>>>> configurable
> >>>>>>>>>>>> number of windows (default is two) and combine them for
> >> the
> >>>>>> estimate.
> >>>>>>>>>>>> So in
> >>>>>>>>>>>> a two sample case depending on when you ask you have
> >>> between
> >>>>> one
> >>>>>> and
> >>>>>>>>>> two
> >>>>>>>>>>>> complete samples worth of data to base the answer off of.
> >>>>>> Provided
> >>>>>>>>>> the
> >>>>>>>>>>>> sample window is large enough to get a valid result this
> >>>>>> satisfies
> >>>>>>>>>> both
> >>>>>>>>>>>> of
> >>>>>>>>>>>> my criteria of incorporating the most recent data and
> >>> having
> >>>>>>>>>> reasonable
> >>>>>>>>>>>> variance at all times.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another approach is to use an exponential weighting
> >> scheme
> >>> to
> >>>>>> combine
> >>>>>>>>>>>> all
> >>>>>>>>>>>> history but emphasize the recent past. I have not done
> >> this
> >>>> as
> >>>>> it
> >>>>>>>>>> has a
> >>>>>>>>>>>> lot
> >>>>>>>>>>>> of issues for practical operational metrics. I'd be happy
> >>> to
> >>>>>>>>>> elaborate
> >>>>>>>>>>>> on
> >>>>>>>>>>>> this if anyone cares...
> >>>>>>>>>>>>
> >>>>>>>>>>>> The window size for metrics has a global default which
> >> can
> >>> be
> >>>>>>>>>>>> overridden at
> >>>>>>>>>>>> either the sensor or individual metric level.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In addition to these time series values the user can
> >>> directly
> >>>>>> expose
> >>>>>>>>>>>> some
> >>>>>>>>>>>> method of their choosing JMX-style by implementing the
> >>>>> Measurable
> >>>>>>>>>>>> interface
> >>>>>>>>>>>> and registering that value. E.g.
> >>>>>>>>>>>>  metrics.addMetric("my.metric", new Measurable() {
> >>>>>>>>>>>>    public double measure(MetricConfg config, long now) {
> >>>>>>>>>>>>       return this.calculateValueToExpose();
> >>>>>>>>>>>>    }
> >>>>>>>>>>>>  });
> >>>>>>>>>>>> This is useful for exposing things like the accumulator
> >>> free
> >>>>>> memory.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The set of metrics is extensible, new metrics can be
> >> added
> >>> by
> >>>>>> just
> >>>>>>>>>>>> implementing the appropriate interfaces and registering
> >>> with
> >>>> a
> >>>>>>>>>> sensor. I
> >>>>>>>>>>>> implement the following metrics:
> >>>>>>>>>>>>  total - the sum of all values from the given sensor
> >>>>>>>>>>>>  count - a windowed count of values from the sensor
> >>>>>>>>>>>>  avg - the sample average within the windows
> >>>>>>>>>>>>  max - the max over the windows
> >>>>>>>>>>>>  min - the min over the windows
> >>>>>>>>>>>>  rate - the rate in the windows (e.g. the total or count
> >>>>>> divided by
> >>>>>>>>>> the
> >>>>>>>>>>>> ellapsed time)
> >>>>>>>>>>>>  percentiles - a collection of percentiles computed over
> >>> the
> >>>>>> window
> >>>>>>>>>>>>
> >>>>>>>>>>>> My approach to percentiles is a little different from the
> >>>>> yammer
> >>>>>>>>>> metrics
> >>>>>>>>>>>> package. My complaint about the yammer metrics approach
> >> is
> >>>> that
> >>>>>> it
> >>>>>>>>>> uses
> >>>>>>>>>>>> rather expensive sampling and uses kind of a lot of
> >> memory
> >>> to
> >>>>>> get a
> >>>>>>>>>>>> reasonable sample. This is problematic for per-topic
> >>>>>> measurements.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Instead I use a fixed range for the histogram (e.g. 0.0
> >> to
> >>>>>> 30000.0)
> >>>>>>>>>>>> which
> >>>>>>>>>>>> directly allows you to specify the desired memory use.
> >> Any
> >>>>> value
> >>>>>>>>>> below
> >>>>>>>>>>>> the
> >>>>>>>>>>>> minimum is recorded as -Infinity and any value above the
> >>>>> maximum
> >>>>>> as
> >>>>>>>>>>>> +Infinity. I think this is okay as all metrics have an
> >>>> expected
> >>>>>> range
> >>>>>>>>>>>> except for latency which can be arbitrarily large, but
> >> for
> >>>> very
> >>>>>> high
> >>>>>>>>>>>> latency there is no need to model it exactly (e.g. 30
> >>>> seconds +
> >>>>>>>>>> really
> >>>>>>>>>>>> is
> >>>>>>>>>>>> effectively infinite). Within the range values are
> >> recorded
> >>>> in
> >>>>>>>>>> buckets
> >>>>>>>>>>>> which can be either fixed width or increasing width. The
> >>>>>> increasing
> >>>>>>>>>>>> width
> >>>>>>>>>>>> is analogous to the idea of significant figures, that is
> >> if
> >>>>> your
> >>>>>>>>>> value
> >>>>>>>>>>>> is
> >>>>>>>>>>>> in the range 0-10 you might want to be accurate to within
> >>>> 1ms,
> >>>>>> but if
> >>>>>>>>>>>> it is
> >>>>>>>>>>>> 20000 there is no need to be so accurate. I implemented a
> >>>>> linear
> >>>>>>>>>> bucket
> >>>>>>>>>>>> size where the Nth bucket has width proportional to N. An
> >>>>>> exponential
> >>>>>>>>>>>> bucket size would also be sensible and could likely be
> >>>> derived
> >>>>>>>>>> directly
> >>>>>>>>>>>> from the floating point representation of a the value.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'd like to get some feedback on this metrics code and
> >>> make a
> >>>>>>>>>> decision
> >>>>>>>>>>>> on
> >>>>>>>>>>>> whether we want to use it before I actually go ahead and
> >>> add
> >>>>> all
> >>>>>> the
> >>>>>>>>>>>> instrumentation in the code (otherwise I'll have to redo
> >> it
> >>>> if
> >>>>> we
> >>>>>>>>>> switch
> >>>>>>>>>>>> approaches). So the next topic of discussion will be
> >> which
> >>>>> actual
> >>>>>>>>>>>> metrics
> >>>>>>>>>>>> to add.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Jay
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Reply via email to