Re: Metrics in new producer

Jun Rao Mon, 24 Feb 2014 21:40:18 -0800

Clark,

As Martin pointed out, if a stat is stable, the numbers that you get from
the new metrics are going to be close to what you get from Coda metrics. If
a stat is not stable, what the new metrics gives you is probably more
intuitive. Given that, would you still want the Coda metrics through a pure
stub?


Thanks,

Jun


On Sat, Feb 22, 2014 at 9:53 AM, Clark Breyman <cl...@breyman.com> wrote:

> Jay - I was thinking of a pure stub rather than just wrapping Kafka metrics
> in a Coda gauge.  I'd like the Timers, Meters etc to still be Coda meters -
> that way the windows, exponential decays, etc are comparable to the rest of
> the Coda metrics in our applications. At the same time, I don't want to
> force Coda timers (or any other timers) on an app that won't make good use
> of them.
>
> Thanks again, C
>
>
> On Sat, Feb 22, 2014 at 9:25 AM, Martin Kleppmann
> <mkleppm...@linkedin.com>wrote:
>
> > Not sure if you want yet another opinion added to the pile -- but since I
> > had a similar problem on another project recently, I thought I'd weigh
> in.
> > (On that project we were originally using Coda's library, but then
> switched
> > to rolling our own metrics implementation because we needed to do a few
> > things differently.)
> >
> > 1. Problems we encountered with Coda's library: it uses an
> > exponentially-weighted moving average (EMWA) for rates (eg.
> messages/sec),
> > and exponentially biased reservoir sampling for histograms (percentiles,
> > averages). Those methods of calculation work well for events with a
> > consistently high volume, but they give strange and misleading results
> for
> > events that are bursty or rare (eg error rates). We found that a
> fixed-size
> > window gives more predictable, easier-to-interpret results.
> >
> > 2. In defence of Coda's library, I think its histogram implementation is
> a
> > good trade-off of memory for accuracy; I'm not totally convinced that
> your
> > proposal (counts of events in a fixed set of buckets) would be much
> better.
> > Would have to do some math to work out the expected accuracy in each
> case.
> > The reservoir sampling can be configured to use a smaller sample if the
> > default of 1028 samples is too expensive. Reservoir sampling also has the
> > advantage that you don't need to hard-code a bucket distribution.
> >
> > 3. Quotas are an interesting use case. However, I'm not wild about using
> a
> > QuotaViolationException for control flow -- I think an explicit
> conditional
> > would be nicer than having to catch an exception. One question in that
> > context: if a quota is exceeded, do you still want to count the event
> > towards the metric, or do you want to stop counting it until the quota is
> > replenished? The answer may depend on the particular metric.
> >
> > 4. If you decide to go with Coda's library, I would advocate isolating
> the
> > dependency into a separate module and using it via a facade -- somewhat
> > like using SLF4J instead of Log4j directly. It's ok for Coda's library to
> > be the default metrics implementation, but it should be easy to swap it
> out
> > for something different in case someone has a version conflict or
> differing
> > requirements. The facade should be at a low level (individual events),
> not
> > at the reporter level (which deals with pre-aggregated values, and is
> > already pluggable).
> >
> > 5. If it's useful, I can probably contribute my simple (but imho
> > effective) metrics library, for embedding into Kafka. It uses reservoir
> > sampling for percentiles, like Coda's library, but uses a fixed-size
> window
> > instead of an exponential bias, which avoids weird behaviour on bursty
> > metrics.
> >
> > In summary, I would advocate one of the following approaches:
> > - Coda Hale library via facade (allowing it to be swapped for something
> > else), or
> > - Own metrics implementation, provided that we have confidence in its
> > implementation of percentiles.
> >
> > Martin
> >
> >
> > On 22 Feb 2014, at 01:06, Jay Kreps <jay.kr...@gmail.com> wrote:
> > > Hey guys,
> > >
> > > Just picking up this thread again. I do want to drive a conclusion as I
> > > will run out of work to do on the producer soon and will need to add
> > > metrics of some sort. We can vote on it, but I'm not sure if we
> actually
> > > got everything discussed.
> > >
> > > Joel, I wasn't fully sure how to interpret your comment. I think you
> are
> > > saying you are cool with the new metrics package as long as it really
> is
> > > better. Do you have any comment on whether you think the benefits I
> > > outlined are worth it? I agree with you that we could hold off on a
> > second
> > > repo until someone else would actually want to use our code.
> > >
> > > Jun, I'm not averse to doing a sampling-based histogram and doing some
> > > comparison between the two approaches if you think this approach is
> > > otherwise better.
> > >
> > > Sriram, originally I thought you preferred just sticking to Coda Hale,
> > but
> > > after your follow-up email I wasn't really sure...
> > >
> > > Joe/Clark, yes this code allows pluggable reporting so you could have a
> > > metrics reporter that just wraps each metric in a Coda Hale Gauge if
> that
> > > is useful. Though obviously if enough people were doing that I would
> > think
> > > it would be worth just using the Coda Hale package directly...
> > >
> > > -Jay
> > >
> > >
> > >
> > >
> > > On Thu, Feb 13, 2014 at 3:34 PM, Clark Breyman <cl...@breyman.com>
> > wrote:
> > >
> > >> Not requiring the client to link Coda/Yammer metrics sounds like a
> > >> compelling reason to pivot to new interfaces. If that's the agreed
> > >> direction, I'm hoping that we'd get the choice of backend to provide
> > (e.g.
> > >> facade on Yammer metrics for those with an investment in that) rather
> > than
> > >> force the new backend.  Having a metrics factory seems better for this
> > than
> > >> directly instantiating the singleton registry.
> > >>
> > >>
> > >> On Thu, Feb 13, 2014 at 2:39 PM, Joe Stein <joe.st...@stealth.ly>
> > wrote:
> > >>
> > >>> Can we leave metrics and have multiple supported KafkaMetricsGroup
> > >>> implementing a yammer based implementation?
> > >>>
> > >>> ProducerRequestStats with your configured analytics group?
> > >>>
> > >>> On Thu, Feb 13, 2014 at 11:37 AM, Jay Kreps <jay.kr...@gmail.com>
> > wrote:
> > >>>
> > >>>> I think we discussed the scala/java stuff more fully previously.
> > >>>> Essentially the client is embedded everywhere. Scala is very
> > >> incompatible
> > >>>> with itself so this makes it very hard to use for people using
> > anything
> > >>>> else in scala. Also Scala stack traces are very confusing. Basically
> > we
> > >>>> thought plain java code would be a lot easier for people to use.
> Even
> > >> if
> > >>>> Scala is more fun to write, that isn't really what we are optimizing
> > >> for.
> > >>>>
> > >>>> -Jay
> > >>>>
> > >>>>
> > >>>> On Thu, Feb 13, 2014 at 8:09 AM, S Ahmed <sahmed1...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Jay, pretty impressive how you just write a 'quick version' like
> that
> > >>> :)
> > >>>>> Not to get off-topic but why didn't you write this in scala?
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Wed, Feb 12, 2014 at 6:54 PM, Joel Koshy <jjkosh...@gmail.com>
> > >>> wrote:
> > >>>>>
> > >>>>>> I have not had a chance to review the new metrics code and its
> > >>>>>> features carefully (apart from your write-up), but here are my
> > >>> general
> > >>>>>> thoughts:
> > >>>>>>
> > >>>>>> Implementing a metrics package correctly is difficult; more so for
> > >>>>>> people like me, because I'm not a statistician.  However, if this
> > >> new
> > >>>>>> package: {(i) functions correctly (and we need to define and prove
> > >>>>>> correctness), (ii) is easy to use, (iii) serves all our current
> and
> > >>>>>> anticipated monitoring needs, (iv) is not overly complex that it
> > >>>>>> becomes a burden to maintain and we are better of with an
> available
> > >>>>>> library;} then I think it makes sense to embed it and use it
> within
> > >>>>>> the Kafka code. The main wins are: (i) predictability (no changing
> > >>>>>> APIs and intimate knowledge of the code) and (ii) control with
> > >>> respect
> > >>>>>> to both functionality (e.g., there are hard-coded decay constants
> > >> in
> > >>>>>> metrics-core 2.x) and correctness (i.e., if we find a bug in the
> > >>>>>> metrics package we have to submit a pull request and wait for it
> to
> > >>>>>> become mainstream).  I'm not sure it would help very much to pull
> > >> it
> > >>>>>> into a separate repo because that could potentially annul these
> > >>>>>> benefits.
> > >>>>>>
> > >>>>>> Joel
> > >>>>>>
> > >>>>>> On Wed, Feb 12, 2014 at 02:50:43PM -0800, Jay Kreps wrote:
> > >>>>>>> Sriram,
> > >>>>>>>
> > >>>>>>> Makes sense. I am cool moving this stuff into its own repo if
> > >>> people
> > >>>>>> think
> > >>>>>>> that is better. I'm not sure it would get much contribution but
> > >>> when
> > >>>> I
> > >>>>>>> started messing with this I did have a lot of grand ideas of
> > >> making
> > >>>>>> adding
> > >>>>>>> metrics to a sensor dynamic so you could add more stuff in
> > >>>>> real-time(via
> > >>>>>>> jmx, say) and/or externalize all your metrics and config to a
> > >>>> separate
> > >>>>>> file
> > >>>>>>> like log4j with only the points of instrumentation hard-coded.
> > >>>>>>>
> > >>>>>>> -Jay
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Wed, Feb 12, 2014 at 2:07 PM, Sriram Subramanian <
> > >>>>>>> srsubraman...@linkedin.com> wrote:
> > >>>>>>>
> > >>>>>>>> I am actually neutral to this change. I found the replies were
> > >>> more
> > >>>>>>>> towards the implementation and features so far. I would like
> > >> the
> > >>>>>> community
> > >>>>>>>> to think about the questions below before making a decision. My
> > >>>>>> opinion on
> > >>>>>>>> this is that it has potential to be its own project and it
> > >> would
> > >>>>>> attract
> > >>>>>>>> developers who are specifically interested in contributing to
> > >>>>> metrics.
> > >>>>>> I
> > >>>>>>>> am skeptical that the Kafka contributors would focus on
> > >> improving
> > >>>>> this
> > >>>>>>>> library (apart from bug fixes) instead of
> > >> developing/contributing
> > >>>> to
> > >>>>>> other
> > >>>>>>>> core pieces. It would be useful to continue and keep it
> > >> decoupled
> > >>>>> from
> > >>>>>>>> rest of Kafka (if it resides in the Kafka code base.) so that
> > >> we
> > >>>> can
> > >>>>>> move
> > >>>>>>>> it out anytime to its own project.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 2/12/14 1:21 PM, "Jay Kreps" <jay.kr...@gmail.com> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hey Sriram,
> > >>>>>>>>>
> > >>>>>>>>> Not sure if these are actually meant as questions or more
> > >> veiled
> > >>>>>> comments.
> > >>>>>>>>> In an case I tried to give my 2 cents inline.
> > >>>>>>>>>
> > >>>>>>>>> On Tue, Feb 11, 2014 at 11:12 PM, Sriram Subramanian <
> > >>>>>>>>> srsubraman...@linkedin.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I think answering the questions below would help to make a
> > >>>> better
> > >>>>>>>>>> decision. I am all for writing better code and having
> > >> superior
> > >>>>>>>>>> functionalities but it is worth thinking about stuff outside
> > >>>> just
> > >>>>>> code
> > >>>>>>>>>> in
> > >>>>>>>>>> this case -
> > >>>>>>>>>>
> > >>>>>>>>>> 1. Does metric form a core piece of kafka? Does it help
> > >> kafka
> > >>>>>> greatly in
> > >>>>>>>>>> providing better core functionalities? I would always like a
> > >>>>>> project to
> > >>>>>>>>>> do
> > >>>>>>>>>> one thing really well. Metrics is a non trivial amount of
> > >>> code.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Metrics are obviously important, and obviously improving our
> > >>>> metrics
> > >>>>>>>>> system
> > >>>>>>>>> would be good. That said this may or may not be better, and
> > >> even
> > >>>> if
> > >>>>>> it is
> > >>>>>>>>> better that betterness might not outweigh other
> > >> considerations.
> > >>>> That
> > >>>>>> is
> > >>>>>>>>> what we are discussing.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> 2. Does it make sense to be part of Kafka or its own
> > >> project?
> > >>> If
> > >>>>>> this
> > >>>>>>>>>> metrics library has the potential to be better than
> > >>>> metrics-core,
> > >>>>> I
> > >>>>>>>>>> would
> > >>>>>>>>>> be interested in other projects take advantage of it.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> It could be either.
> > >>>>>>>>>
> > >>>>>>>>> 3. Can Kafka maintain this library as new members join and old
> > >>>>> members
> > >>>>>>>>>> leave? Would this be a piece of code that no one (in Kafka)
> > >> in
> > >>>> the
> > >>>>>>>>>> future
> > >>>>>>>>>> spends time improving if the original author left?
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I am not going anywhere in the near term, but if I did, yes,
> > >>> this
> > >>>>>> would be
> > >>>>>>>>> like any other code we have. As with yammer metrics or any
> > >> other
> > >>>>> code
> > >>>>>> at
> > >>>>>>>>> that point we would either use it as is or someone would
> > >> improve
> > >>>> it.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> 4. Does it affect the schedule of producer rewrite? This
> > >> needs
> > >>>> its
> > >>>>>> own
> > >>>>>>>>>> stabilization and modification to existing metric dashboards
> > >>> if
> > >>>>> the
> > >>>>>>>>>> format
> > >>>>>>>>>> is changed. Many times such cost are not factored in and a
> > >>>> project
> > >>>>>> loses
> > >>>>>>>>>> time before realizing the extra time required to make a
> > >>> library
> > >>>> as
> > >>>>>> this
> > >>>>>>>>>> operational.
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Probably not. The metrics are going to change regardless of
> > >>>> whether
> > >>>>>> we use
> > >>>>>>>>> the same library or not. If we think this is better I don't
> > >> mind
> > >>>>>> putting
> > >>>>>>>>> in
> > >>>>>>>>> a little extra effort to get there.
> > >>>>>>>>>
> > >>>>>>>>> Irrespective I think this is probably not the right thing to
> > >>>>> optimize
> > >>>>>> for.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> I am sure we can do better when we write code to a specific
> > >>> use
> > >>>>>> case (in
> > >>>>>>>>>> this case, kafka) rather than building a generic library
> > >> that
> > >>>>> suits
> > >>>>>> all
> > >>>>>>>>>> (metrics-core) but I would like us to have answers to the
> > >>>>> questions
> > >>>>>>>>>> above
> > >>>>>>>>>> and be prepared before we proceed to support this with the
> > >>>>> producer
> > >>>>>>>>>> rewrite.
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Naturally we are all considering exactly these things, that is
> > >>>>>> exactly the
> > >>>>>>>>> reason I started the thread.
> > >>>>>>>>>
> > >>>>>>>>> -Jay
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>> On 2/11/14 6:28 PM, "Jun Rao" <jun...@gmail.com> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Thanks for the detailed write-up. It's well thought
> > >> through.
> > >>> A
> > >>>>> few
> > >>>>>>>>>>> comments:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. I have a couple of concerns on the percentiles. The
> > >> first
> > >>>>> issue
> > >>>>>> is
> > >>>>>>>>>> that
> > >>>>>>>>>>> It requires the user to know the value range. Since the
> > >> range
> > >>>> for
> > >>>>>>>>>> things
> > >>>>>>>>>>> like message size (in millions) is quite different from
> > >> those
> > >>>>> like
> > >>>>>>>>>> request
> > >>>>>>>>>>> time (less than 100), it's going to be hard to pick a good
> > >>>> global
> > >>>>>>>>>> default
> > >>>>>>>>>>> range. Different apps could be dealing with different
> > >> message
> > >>>>>> size. So
> > >>>>>>>>>>> they
> > >>>>>>>>>>> probably will have to customize the range. Another issue is
> > >>>> that
> > >>>>>> it can
> > >>>>>>>>>>> only report values at the bucket boundaries. So, if you
> > >> have
> > >>>> 1000
> > >>>>>>>>>> buckets
> > >>>>>>>>>>> and a value range of 1 million, you will only see 1000
> > >>> possible
> > >>>>>> values
> > >>>>>>>>>> as
> > >>>>>>>>>>> the quantile, which is probably too sparse. The
> > >>> implementation
> > >>>> of
> > >>>>>>>>>>> histogram
> > >>>>>>>>>>> in metrics-core keeps a fix size of samples, which avoids
> > >>> both
> > >>>>>> issues.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. We need to document the 3-part metrics names better
> > >> since
> > >>>> it's
> > >>>>>> not
> > >>>>>>>>>>> obvious what the convention is. Also, currently the name of
> > >>> the
> > >>>>>> sensor
> > >>>>>>>>>> and
> > >>>>>>>>>>> the metrics defined in it are independent. Would it make
> > >>> sense
> > >>>> to
> > >>>>>> have
> > >>>>>>>>>> the
> > >>>>>>>>>>> sensor name be a prefix of the metric name?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Overall, this approach seems to be cleaner than
> > >> metrics-core
> > >>> by
> > >>>>>>>>>> decoupling
> > >>>>>>>>>>> measuring and reporting. The main benefit of metrics-core
> > >>> seems
> > >>>>> to
> > >>>>>> be
> > >>>>>>>>>> the
> > >>>>>>>>>>> existing reporters. Since not that many people voted for
> > >>>>>> metrics-core,
> > >>>>>>>>>> I
> > >>>>>>>>>>> am
> > >>>>>>>>>>> ok with going with the new implementation. My only
> > >>>> recommendation
> > >>>>>> is to
> > >>>>>>>>>>> address the concern on percentiles.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jun
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Feb 6, 2014 at 12:51 PM, Jay Kreps <
> > >>>> jay.kr...@gmail.com>
> > >>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hey guys,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I wanted to kick off a quick discussion of metrics with
> > >>>> respect
> > >>>>>> to
> > >>>>>>>>>> the
> > >>>>>>>>>>>> new
> > >>>>>>>>>>>> producer and consumer (and potentially the server).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> At a high level I think there are three approaches we
> > >> could
> > >>>>> take:
> > >>>>>>>>>>>> 1. Plain vanilla JMX
> > >>>>>>>>>>>> 2. Use Coda Hale (AKA Yammer) Metrics
> > >>>>>>>>>>>> 3. Do our own metrics (with JMX as one output)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1. Has the advantage that JMX is the most commonly used
> > >>> java
> > >>>>>> thing
> > >>>>>>>>>> and
> > >>>>>>>>>>>> plugs in reasonably to most metrics systems. JMX is
> > >>> included
> > >>>> in
> > >>>>>> the
> > >>>>>>>>>> JDK
> > >>>>>>>>>>>> so
> > >>>>>>>>>>>> it doesn't impose any additional dependencies on clients.
> > >>> It
> > >>>>> has
> > >>>>>> the
> > >>>>>>>>>>>> disadvantage that plain vanilla JMX is a pain to use. We
> > >>>> would
> > >>>>>> need a
> > >>>>>>>>>>>> bunch
> > >>>>>>>>>>>> of helper code for maintaining counters to make this
> > >>>>> reasonable.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 2. Coda Hale metrics is pretty good and broadly used. It
> > >>>>>> supports JMX
> > >>>>>>>>>>>> output as well as direct output to many other types of
> > >>>> systems.
> > >>>>>> The
> > >>>>>>>>>>>> primary
> > >>>>>>>>>>>> downside we have had with Coda Hale has to do with the
> > >>>> clients
> > >>>>>> and
> > >>>>>>>>>>>> library
> > >>>>>>>>>>>> incompatibilities. We are currently on an older more
> > >>> popular
> > >>>>>> version.
> > >>>>>>>>>>>> The
> > >>>>>>>>>>>> newer version is a rewrite of the APIs and is
> > >> incompatible.
> > >>>>>>>>>> Originally
> > >>>>>>>>>>>> these were totally incompatible and people had to choose
> > >>> one
> > >>>> or
> > >>>>>> the
> > >>>>>>>>>>>> other.
> > >>>>>>>>>>>> I think that has been improved so now the new version is
> > >> a
> > >>>>>> totally
> > >>>>>>>>>>>> different package. But even in this case you end up with
> > >>> both
> > >>>>>>>>>> versions
> > >>>>>>>>>>>> if
> > >>>>>>>>>>>> you use Kafka and we are on a different version than you
> > >>>> which
> > >>>>> is
> > >>>>>>>>>> going
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> be pretty inconvenient.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 3. Doing our own has the downside of potentially
> > >>> reinventing
> > >>>>> the
> > >>>>>>>>>> wheel,
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>> potentially needing to work out any bugs in our code. The
> > >>>>> upsides
> > >>>>>>>>>> would
> > >>>>>>>>>>>> depend on the how good the reinvention was. As it
> > >> happens I
> > >>>>> did a
> > >>>>>>>>>> quick
> > >>>>>>>>>>>> (~900 loc) version of a metrics library that is under
> > >>>>>>>>>>>> kafka.common.metrics.
> > >>>>>>>>>>>> I think it has some advantages over the Yammer metrics
> > >>>> package
> > >>>>>> for
> > >>>>>>>>>> our
> > >>>>>>>>>>>> usage beyond just not causing incompatibilities. I will
> > >>>>> describe
> > >>>>>> this
> > >>>>>>>>>>>> code
> > >>>>>>>>>>>> so we can discuss the pros and cons. Although I favor
> > >> this
> > >>>>>> approach I
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>> no emotional attachment and wouldn't be too sad if I
> > >> ended
> > >>> up
> > >>>>>>>>>> deleting
> > >>>>>>>>>>>> it.
> > >>>>>>>>>>>> Here are javadocs for this code, though I haven't written
> > >>>> much
> > >>>>>>>>>>>> documentation yet since I might end up deleting it:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Here is a quick overview of this library.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> There are three main public interfaces:
> > >>>>>>>>>>>>  Metrics - This is a repository of metrics being
> > >> tracked.
> > >>>>>>>>>>>>  Metric - A single, named numerical value being measured
> > >>>>> (i.e. a
> > >>>>>>>>>>>> counter).
> > >>>>>>>>>>>>  Sensor - This is a thing that records values and
> > >> updates
> > >>>> zero
> > >>>>>> or
> > >>>>>>>>>> more
> > >>>>>>>>>>>> metrics
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> So let's say we want to track three values about message
> > >>>> sizes;
> > >>>>>>>>>>>> specifically say we want to record the average, the
> > >>> maximum,
> > >>>>> the
> > >>>>>>>>>> total
> > >>>>>>>>>>>> rate
> > >>>>>>>>>>>> of bytes being sent, and a count of messages. Then we
> > >> would
> > >>>> do
> > >>>>>>>>>> something
> > >>>>>>>>>>>> like this:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>   // setup code
> > >>>>>>>>>>>>   Metrics metrics = new Metrics(); // this is a global
> > >>>>>> "singleton"
> > >>>>>>>>>>>>   Sensor sensor =
> > >>>>>> metrics.sensor("kafka.producer.message.sizes");
> > >>>>>>>>>>>>   sensor.add("kafka.producer.message-size.avg", new
> > >>> Avg());
> > >>>>>>>>>>>>   sensor.add("kafka.producer.message-size.max", new
> > >>> Max());
> > >>>>>>>>>>>>   sensor.add("kafka.producer.bytes-sent-per-sec", new
> > >>>> Rate());
> > >>>>>>>>>>>>   sensor.add("kafka.producer.message-count", new
> > >> Count());
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>   // now when we get a message we do this
> > >>>>>>>>>>>>   sensor.record(messageSize);
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The above code creates the global metrics repository,
> > >>>> creates a
> > >>>>>>>>>> single
> > >>>>>>>>>>>> Sensor, and defines 5 named metrics that are updated by
> > >>> that
> > >>>>>> Sensor.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Like Yammer Metrics (YM) I allow you to plug in
> > >>> "reporters",
> > >>>>>>>>>> including a
> > >>>>>>>>>>>> JMX reporter. Unlike the Coda Hale JMX reporter the
> > >>> reporter
> > >>>> I
> > >>>>>> have
> > >>>>>>>>>> keys
> > >>>>>>>>>>>> off the metric names not the Sensor names, which I think
> > >> is
> > >>>> an
> > >>>>>>>>>>>> improvement--I just use the convention that the last
> > >>> portion
> > >>>> of
> > >>>>>> the
> > >>>>>>>>>>>> name is
> > >>>>>>>>>>>> the attribute name, the second to last is the mbean name,
> > >>> and
> > >>>>> the
> > >>>>>>>>>> rest
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>> the package. So in the above example there is a producer
> > >>>> mbean
> > >>>>>> that
> > >>>>>>>>>> has
> > >>>>>>>>>>>> a
> > >>>>>>>>>>>> avg and max attribute and a producer mbean that has a
> > >>>>>>>>>> bytes-sent-per-sec
> > >>>>>>>>>>>> and message-count attribute. This is nice because you can
> > >>>>>> logically
> > >>>>>>>>>>>> group
> > >>>>>>>>>>>> the values reported irrespective of where in the program
> > >>> they
> > >>>>> are
> > >>>>>>>>>>>> computed--that is an mbean can logically group attributes
> > >>>>>> computed
> > >>>>>>>>>> off
> > >>>>>>>>>>>> different sensors. This means you can report values by
> > >>>> logical
> > >>>>>>>>>>>> subsystem.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I also allow the concept of hierarchical Sensors which I
> > >>>> think
> > >>>>>> is a
> > >>>>>>>>>> good
> > >>>>>>>>>>>> convenience. I have noticed a common pattern in systems
> > >>> where
> > >>>>> you
> > >>>>>>>>>> need
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> roll up the same values along different dimensions. An
> > >>> simple
> > >>>>>>>>>> example is
> > >>>>>>>>>>>> metrics about qps, data rate, etc on the broker. These we
> > >>>> want
> > >>>>> to
> > >>>>>>>>>>>> capture
> > >>>>>>>>>>>> in aggregate, but also broken down by topic-id. You can
> > >> do
> > >>>> this
> > >>>>>>>>>> purely
> > >>>>>>>>>>>> by
> > >>>>>>>>>>>> defining the sensor hierarchy:
> > >>>>>>>>>>>> Sensor allSizes = metrics.sensor("kafka.producer.sizes");
> > >>>>>>>>>>>> Sensor topicSizes = metrics.sensor("kafka.producer." +
> > >>> topic
> > >>>> +
> > >>>>>>>>>>>> ".sizes",
> > >>>>>>>>>>>> allSizes);
> > >>>>>>>>>>>> Now each actual update will go to the appropriate
> > >>> topicSizes
> > >>>>>> sensor
> > >>>>>>>>>>>> (based
> > >>>>>>>>>>>> on the topic name), but allSizes metrics will get updated
> > >>>> too.
> > >>>>> I
> > >>>>>> also
> > >>>>>>>>>>>> support multiple parents for each sensor as well as
> > >>> multiple
> > >>>>>> layers
> > >>>>>>>>>> of
> > >>>>>>>>>>>> hiearchy, so you can define a more elaborate DAG of
> > >>> sensors.
> > >>>> An
> > >>>>>>>>>> example
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>> how this would be useful is if you wanted to record your
> > >>>>> metrics
> > >>>>>>>>>> broken
> > >>>>>>>>>>>> down by topic AND client id as well as the global
> > >>> aggregate.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Each metric can take a configurable Quota value which
> > >>> allows
> > >>>> us
> > >>>>>> to
> > >>>>>>>>>> limit
> > >>>>>>>>>>>> the maximum value of that sensor. This is intended for
> > >> use
> > >>> on
> > >>>>> the
> > >>>>>>>>>>>> server as
> > >>>>>>>>>>>> part of our Quota implementation. The way this works is
> > >>> that
> > >>>>> you
> > >>>>>>>>>> record
> > >>>>>>>>>>>> metrics as usual:
> > >>>>>>>>>>>>   mySensor.record(42.0)
> > >>>>>>>>>>>> However if this event occurance causes one of the metrics
> > >>> to
> > >>>>>> exceed
> > >>>>>>>>>> its
> > >>>>>>>>>>>> maximum allowable value (the quota) this call will throw
> > >> a
> > >>>>>>>>>>>> QuotaViolationException. The cool thing about this is
> > >> that
> > >>> it
> > >>>>>> means
> > >>>>>>>>>> we
> > >>>>>>>>>>>> can
> > >>>>>>>>>>>> define quotas on anything we capture metrics for, which I
> > >>>> think
> > >>>>>> is
> > >>>>>>>>>>>> pretty
> > >>>>>>>>>>>> cool.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Another question is how to handle windowing of the
> > >> values?
> > >>>>>> Metrics
> > >>>>>>>>>> want
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> record the "current" value, but the definition of current
> > >>> is
> > >>>>>>>>>> inherently
> > >>>>>>>>>>>> nebulous. A few of the obvious gotchas are that if you
> > >>> define
> > >>>>>>>>>> "current"
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> be a number of events you can end up measuring an
> > >>> arbitrarily
> > >>>>>> long
> > >>>>>>>>>>>> window
> > >>>>>>>>>>>> of time if the event rate is low (e.g. you think you are
> > >>>>> getting
> > >>>>>> 50
> > >>>>>>>>>>>> messages/sec because that was the rate yesterday when all
> > >>>>> events
> > >>>>>>>>>>>> topped).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Here is how I approach this. All the metrics use the same
> > >>>>>> windowing
> > >>>>>>>>>>>> approach. We define a single window by a length of time
> > >> or
> > >>>>>> number of
> > >>>>>>>>>>>> values
> > >>>>>>>>>>>> (you can use either or both--if both the window ends when
> > >>>>>> *either*
> > >>>>>>>>>> the
> > >>>>>>>>>>>> time
> > >>>>>>>>>>>> bound or event bound is hit). The typical problem with
> > >> hard
> > >>>>>> window
> > >>>>>>>>>>>> boundaries is that at the beginning of the window you
> > >> have
> > >>> no
> > >>>>>> data
> > >>>>>>>>>> and
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>> first few samples are too small to be a valid sample.
> > >>>> (Consider
> > >>>>>> if
> > >>>>>>>>>> you
> > >>>>>>>>>>>> were
> > >>>>>>>>>>>> keeping an avg and the first value in the window happens
> > >> to
> > >>>> be
> > >>>>>> very
> > >>>>>>>>>> very
> > >>>>>>>>>>>> high, if you check the avg at this exact time you will
> > >>>> conclude
> > >>>>>> the
> > >>>>>>>>>> avg
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>> very high but on a sample size of one). One simple fix
> > >>> would
> > >>>> be
> > >>>>>> to
> > >>>>>>>>>>>> always
> > >>>>>>>>>>>> report the last complete window, however this is not
> > >>>>> appropriate
> > >>>>>> here
> > >>>>>>>>>>>> because (1) we want to drive quotas off it so it needs to
> > >>> be
> > >>>>>> current,
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>> (2) since this is for monitoring you kind of care more
> > >>> about
> > >>>>> the
> > >>>>>>>>>> current
> > >>>>>>>>>>>> state. The ideal solution here would be to define a
> > >>> backwards
> > >>>>>> looking
> > >>>>>>>>>>>> sliding window from the present, but many statistics are
> > >>>>> actually
> > >>>>>>>>>> very
> > >>>>>>>>>>>> hard
> > >>>>>>>>>>>> to compute in this model without retaining all the values
> > >>>> which
> > >>>>>>>>>> would be
> > >>>>>>>>>>>> hopelessly inefficient. My solution to this is to keep a
> > >>>>>> configurable
> > >>>>>>>>>>>> number of windows (default is two) and combine them for
> > >> the
> > >>>>>> estimate.
> > >>>>>>>>>>>> So in
> > >>>>>>>>>>>> a two sample case depending on when you ask you have
> > >>> between
> > >>>>> one
> > >>>>>> and
> > >>>>>>>>>> two
> > >>>>>>>>>>>> complete samples worth of data to base the answer off of.
> > >>>>>> Provided
> > >>>>>>>>>> the
> > >>>>>>>>>>>> sample window is large enough to get a valid result this
> > >>>>>> satisfies
> > >>>>>>>>>> both
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>> my criteria of incorporating the most recent data and
> > >>> having
> > >>>>>>>>>> reasonable
> > >>>>>>>>>>>> variance at all times.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Another approach is to use an exponential weighting
> > >> scheme
> > >>> to
> > >>>>>> combine
> > >>>>>>>>>>>> all
> > >>>>>>>>>>>> history but emphasize the recent past. I have not done
> > >> this
> > >>>> as
> > >>>>> it
> > >>>>>>>>>> has a
> > >>>>>>>>>>>> lot
> > >>>>>>>>>>>> of issues for practical operational metrics. I'd be happy
> > >>> to
> > >>>>>>>>>> elaborate
> > >>>>>>>>>>>> on
> > >>>>>>>>>>>> this if anyone cares...
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The window size for metrics has a global default which
> > >> can
> > >>> be
> > >>>>>>>>>>>> overridden at
> > >>>>>>>>>>>> either the sensor or individual metric level.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> In addition to these time series values the user can
> > >>> directly
> > >>>>>> expose
> > >>>>>>>>>>>> some
> > >>>>>>>>>>>> method of their choosing JMX-style by implementing the
> > >>>>> Measurable
> > >>>>>>>>>>>> interface
> > >>>>>>>>>>>> and registering that value. E.g.
> > >>>>>>>>>>>>  metrics.addMetric("my.metric", new Measurable() {
> > >>>>>>>>>>>>    public double measure(MetricConfg config, long now) {
> > >>>>>>>>>>>>       return this.calculateValueToExpose();
> > >>>>>>>>>>>>    }
> > >>>>>>>>>>>>  });
> > >>>>>>>>>>>> This is useful for exposing things like the accumulator
> > >>> free
> > >>>>>> memory.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The set of metrics is extensible, new metrics can be
> > >> added
> > >>> by
> > >>>>>> just
> > >>>>>>>>>>>> implementing the appropriate interfaces and registering
> > >>> with
> > >>>> a
> > >>>>>>>>>> sensor. I
> > >>>>>>>>>>>> implement the following metrics:
> > >>>>>>>>>>>>  total - the sum of all values from the given sensor
> > >>>>>>>>>>>>  count - a windowed count of values from the sensor
> > >>>>>>>>>>>>  avg - the sample average within the windows
> > >>>>>>>>>>>>  max - the max over the windows
> > >>>>>>>>>>>>  min - the min over the windows
> > >>>>>>>>>>>>  rate - the rate in the windows (e.g. the total or count
> > >>>>>> divided by
> > >>>>>>>>>> the
> > >>>>>>>>>>>> ellapsed time)
> > >>>>>>>>>>>>  percentiles - a collection of percentiles computed over
> > >>> the
> > >>>>>> window
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> My approach to percentiles is a little different from the
> > >>>>> yammer
> > >>>>>>>>>> metrics
> > >>>>>>>>>>>> package. My complaint about the yammer metrics approach
> > >> is
> > >>>> that
> > >>>>>> it
> > >>>>>>>>>> uses
> > >>>>>>>>>>>> rather expensive sampling and uses kind of a lot of
> > >> memory
> > >>> to
> > >>>>>> get a
> > >>>>>>>>>>>> reasonable sample. This is problematic for per-topic
> > >>>>>> measurements.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Instead I use a fixed range for the histogram (e.g. 0.0
> > >> to
> > >>>>>> 30000.0)
> > >>>>>>>>>>>> which
> > >>>>>>>>>>>> directly allows you to specify the desired memory use.
> > >> Any
> > >>>>> value
> > >>>>>>>>>> below
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>> minimum is recorded as -Infinity and any value above the
> > >>>>> maximum
> > >>>>>> as
> > >>>>>>>>>>>> +Infinity. I think this is okay as all metrics have an
> > >>>> expected
> > >>>>>> range
> > >>>>>>>>>>>> except for latency which can be arbitrarily large, but
> > >> for
> > >>>> very
> > >>>>>> high
> > >>>>>>>>>>>> latency there is no need to model it exactly (e.g. 30
> > >>>> seconds +
> > >>>>>>>>>> really
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>> effectively infinite). Within the range values are
> > >> recorded
> > >>>> in
> > >>>>>>>>>> buckets
> > >>>>>>>>>>>> which can be either fixed width or increasing width. The
> > >>>>>> increasing
> > >>>>>>>>>>>> width
> > >>>>>>>>>>>> is analogous to the idea of significant figures, that is
> > >> if
> > >>>>> your
> > >>>>>>>>>> value
> > >>>>>>>>>>>> is
> > >>>>>>>>>>>> in the range 0-10 you might want to be accurate to within
> > >>>> 1ms,
> > >>>>>> but if
> > >>>>>>>>>>>> it is
> > >>>>>>>>>>>> 20000 there is no need to be so accurate. I implemented a
> > >>>>> linear
> > >>>>>>>>>> bucket
> > >>>>>>>>>>>> size where the Nth bucket has width proportional to N. An
> > >>>>>> exponential
> > >>>>>>>>>>>> bucket size would also be sensible and could likely be
> > >>>> derived
> > >>>>>>>>>> directly
> > >>>>>>>>>>>> from the floating point representation of a the value.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'd like to get some feedback on this metrics code and
> > >>> make a
> > >>>>>>>>>> decision
> > >>>>>>>>>>>> on
> > >>>>>>>>>>>> whether we want to use it before I actually go ahead and
> > >>> add
> > >>>>> all
> > >>>>>> the
> > >>>>>>>>>>>> instrumentation in the code (otherwise I'll have to redo
> > >> it
> > >>>> if
> > >>>>> we
> > >>>>>>>>>> switch
> > >>>>>>>>>>>> approaches). So the next topic of discussion will be
> > >> which
> > >>>>> actual
> > >>>>>>>>>>>> metrics
> > >>>>>>>>>>>> to add.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> -Jay
> > >>>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: Metrics in new producer

Reply via email to