Jay - I was thinking of a pure stub rather than just wrapping Kafka metrics in a Coda gauge. I'd like the Timers, Meters etc to still be Coda meters - that way the windows, exponential decays, etc are comparable to the rest of the Coda metrics in our applications. At the same time, I don't want to force Coda timers (or any other timers) on an app that won't make good use of them.
Thanks again, C On Sat, Feb 22, 2014 at 9:25 AM, Martin Kleppmann <mkleppm...@linkedin.com>wrote: > Not sure if you want yet another opinion added to the pile -- but since I > had a similar problem on another project recently, I thought I'd weigh in. > (On that project we were originally using Coda's library, but then switched > to rolling our own metrics implementation because we needed to do a few > things differently.) > > 1. Problems we encountered with Coda's library: it uses an > exponentially-weighted moving average (EMWA) for rates (eg. messages/sec), > and exponentially biased reservoir sampling for histograms (percentiles, > averages). Those methods of calculation work well for events with a > consistently high volume, but they give strange and misleading results for > events that are bursty or rare (eg error rates). We found that a fixed-size > window gives more predictable, easier-to-interpret results. > > 2. In defence of Coda's library, I think its histogram implementation is a > good trade-off of memory for accuracy; I'm not totally convinced that your > proposal (counts of events in a fixed set of buckets) would be much better. > Would have to do some math to work out the expected accuracy in each case. > The reservoir sampling can be configured to use a smaller sample if the > default of 1028 samples is too expensive. Reservoir sampling also has the > advantage that you don't need to hard-code a bucket distribution. > > 3. Quotas are an interesting use case. However, I'm not wild about using a > QuotaViolationException for control flow -- I think an explicit conditional > would be nicer than having to catch an exception. One question in that > context: if a quota is exceeded, do you still want to count the event > towards the metric, or do you want to stop counting it until the quota is > replenished? The answer may depend on the particular metric. > > 4. If you decide to go with Coda's library, I would advocate isolating the > dependency into a separate module and using it via a facade -- somewhat > like using SLF4J instead of Log4j directly. It's ok for Coda's library to > be the default metrics implementation, but it should be easy to swap it out > for something different in case someone has a version conflict or differing > requirements. The facade should be at a low level (individual events), not > at the reporter level (which deals with pre-aggregated values, and is > already pluggable). > > 5. If it's useful, I can probably contribute my simple (but imho > effective) metrics library, for embedding into Kafka. It uses reservoir > sampling for percentiles, like Coda's library, but uses a fixed-size window > instead of an exponential bias, which avoids weird behaviour on bursty > metrics. > > In summary, I would advocate one of the following approaches: > - Coda Hale library via facade (allowing it to be swapped for something > else), or > - Own metrics implementation, provided that we have confidence in its > implementation of percentiles. > > Martin > > > On 22 Feb 2014, at 01:06, Jay Kreps <jay.kr...@gmail.com> wrote: > > Hey guys, > > > > Just picking up this thread again. I do want to drive a conclusion as I > > will run out of work to do on the producer soon and will need to add > > metrics of some sort. We can vote on it, but I'm not sure if we actually > > got everything discussed. > > > > Joel, I wasn't fully sure how to interpret your comment. I think you are > > saying you are cool with the new metrics package as long as it really is > > better. Do you have any comment on whether you think the benefits I > > outlined are worth it? I agree with you that we could hold off on a > second > > repo until someone else would actually want to use our code. > > > > Jun, I'm not averse to doing a sampling-based histogram and doing some > > comparison between the two approaches if you think this approach is > > otherwise better. > > > > Sriram, originally I thought you preferred just sticking to Coda Hale, > but > > after your follow-up email I wasn't really sure... > > > > Joe/Clark, yes this code allows pluggable reporting so you could have a > > metrics reporter that just wraps each metric in a Coda Hale Gauge if that > > is useful. Though obviously if enough people were doing that I would > think > > it would be worth just using the Coda Hale package directly... > > > > -Jay > > > > > > > > > > On Thu, Feb 13, 2014 at 3:34 PM, Clark Breyman <cl...@breyman.com> > wrote: > > > >> Not requiring the client to link Coda/Yammer metrics sounds like a > >> compelling reason to pivot to new interfaces. If that's the agreed > >> direction, I'm hoping that we'd get the choice of backend to provide > (e.g. > >> facade on Yammer metrics for those with an investment in that) rather > than > >> force the new backend. Having a metrics factory seems better for this > than > >> directly instantiating the singleton registry. > >> > >> > >> On Thu, Feb 13, 2014 at 2:39 PM, Joe Stein <joe.st...@stealth.ly> > wrote: > >> > >>> Can we leave metrics and have multiple supported KafkaMetricsGroup > >>> implementing a yammer based implementation? > >>> > >>> ProducerRequestStats with your configured analytics group? > >>> > >>> On Thu, Feb 13, 2014 at 11:37 AM, Jay Kreps <jay.kr...@gmail.com> > wrote: > >>> > >>>> I think we discussed the scala/java stuff more fully previously. > >>>> Essentially the client is embedded everywhere. Scala is very > >> incompatible > >>>> with itself so this makes it very hard to use for people using > anything > >>>> else in scala. Also Scala stack traces are very confusing. Basically > we > >>>> thought plain java code would be a lot easier for people to use. Even > >> if > >>>> Scala is more fun to write, that isn't really what we are optimizing > >> for. > >>>> > >>>> -Jay > >>>> > >>>> > >>>> On Thu, Feb 13, 2014 at 8:09 AM, S Ahmed <sahmed1...@gmail.com> > wrote: > >>>> > >>>>> Jay, pretty impressive how you just write a 'quick version' like that > >>> :) > >>>>> Not to get off-topic but why didn't you write this in scala? > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Feb 12, 2014 at 6:54 PM, Joel Koshy <jjkosh...@gmail.com> > >>> wrote: > >>>>> > >>>>>> I have not had a chance to review the new metrics code and its > >>>>>> features carefully (apart from your write-up), but here are my > >>> general > >>>>>> thoughts: > >>>>>> > >>>>>> Implementing a metrics package correctly is difficult; more so for > >>>>>> people like me, because I'm not a statistician. However, if this > >> new > >>>>>> package: {(i) functions correctly (and we need to define and prove > >>>>>> correctness), (ii) is easy to use, (iii) serves all our current and > >>>>>> anticipated monitoring needs, (iv) is not overly complex that it > >>>>>> becomes a burden to maintain and we are better of with an available > >>>>>> library;} then I think it makes sense to embed it and use it within > >>>>>> the Kafka code. The main wins are: (i) predictability (no changing > >>>>>> APIs and intimate knowledge of the code) and (ii) control with > >>> respect > >>>>>> to both functionality (e.g., there are hard-coded decay constants > >> in > >>>>>> metrics-core 2.x) and correctness (i.e., if we find a bug in the > >>>>>> metrics package we have to submit a pull request and wait for it to > >>>>>> become mainstream). I'm not sure it would help very much to pull > >> it > >>>>>> into a separate repo because that could potentially annul these > >>>>>> benefits. > >>>>>> > >>>>>> Joel > >>>>>> > >>>>>> On Wed, Feb 12, 2014 at 02:50:43PM -0800, Jay Kreps wrote: > >>>>>>> Sriram, > >>>>>>> > >>>>>>> Makes sense. I am cool moving this stuff into its own repo if > >>> people > >>>>>> think > >>>>>>> that is better. I'm not sure it would get much contribution but > >>> when > >>>> I > >>>>>>> started messing with this I did have a lot of grand ideas of > >> making > >>>>>> adding > >>>>>>> metrics to a sensor dynamic so you could add more stuff in > >>>>> real-time(via > >>>>>>> jmx, say) and/or externalize all your metrics and config to a > >>>> separate > >>>>>> file > >>>>>>> like log4j with only the points of instrumentation hard-coded. > >>>>>>> > >>>>>>> -Jay > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Feb 12, 2014 at 2:07 PM, Sriram Subramanian < > >>>>>>> srsubraman...@linkedin.com> wrote: > >>>>>>> > >>>>>>>> I am actually neutral to this change. I found the replies were > >>> more > >>>>>>>> towards the implementation and features so far. I would like > >> the > >>>>>> community > >>>>>>>> to think about the questions below before making a decision. My > >>>>>> opinion on > >>>>>>>> this is that it has potential to be its own project and it > >> would > >>>>>> attract > >>>>>>>> developers who are specifically interested in contributing to > >>>>> metrics. > >>>>>> I > >>>>>>>> am skeptical that the Kafka contributors would focus on > >> improving > >>>>> this > >>>>>>>> library (apart from bug fixes) instead of > >> developing/contributing > >>>> to > >>>>>> other > >>>>>>>> core pieces. It would be useful to continue and keep it > >> decoupled > >>>>> from > >>>>>>>> rest of Kafka (if it resides in the Kafka code base.) so that > >> we > >>>> can > >>>>>> move > >>>>>>>> it out anytime to its own project. > >>>>>>>> > >>>>>>>> > >>>>>>>> On 2/12/14 1:21 PM, "Jay Kreps" <jay.kr...@gmail.com> wrote: > >>>>>>>> > >>>>>>>>> Hey Sriram, > >>>>>>>>> > >>>>>>>>> Not sure if these are actually meant as questions or more > >> veiled > >>>>>> comments. > >>>>>>>>> In an case I tried to give my 2 cents inline. > >>>>>>>>> > >>>>>>>>> On Tue, Feb 11, 2014 at 11:12 PM, Sriram Subramanian < > >>>>>>>>> srsubraman...@linkedin.com> wrote: > >>>>>>>>> > >>>>>>>>>> I think answering the questions below would help to make a > >>>> better > >>>>>>>>>> decision. I am all for writing better code and having > >> superior > >>>>>>>>>> functionalities but it is worth thinking about stuff outside > >>>> just > >>>>>> code > >>>>>>>>>> in > >>>>>>>>>> this case - > >>>>>>>>>> > >>>>>>>>>> 1. Does metric form a core piece of kafka? Does it help > >> kafka > >>>>>> greatly in > >>>>>>>>>> providing better core functionalities? I would always like a > >>>>>> project to > >>>>>>>>>> do > >>>>>>>>>> one thing really well. Metrics is a non trivial amount of > >>> code. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Metrics are obviously important, and obviously improving our > >>>> metrics > >>>>>>>>> system > >>>>>>>>> would be good. That said this may or may not be better, and > >> even > >>>> if > >>>>>> it is > >>>>>>>>> better that betterness might not outweigh other > >> considerations. > >>>> That > >>>>>> is > >>>>>>>>> what we are discussing. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> 2. Does it make sense to be part of Kafka or its own > >> project? > >>> If > >>>>>> this > >>>>>>>>>> metrics library has the potential to be better than > >>>> metrics-core, > >>>>> I > >>>>>>>>>> would > >>>>>>>>>> be interested in other projects take advantage of it. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> It could be either. > >>>>>>>>> > >>>>>>>>> 3. Can Kafka maintain this library as new members join and old > >>>>> members > >>>>>>>>>> leave? Would this be a piece of code that no one (in Kafka) > >> in > >>>> the > >>>>>>>>>> future > >>>>>>>>>> spends time improving if the original author left? > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> I am not going anywhere in the near term, but if I did, yes, > >>> this > >>>>>> would be > >>>>>>>>> like any other code we have. As with yammer metrics or any > >> other > >>>>> code > >>>>>> at > >>>>>>>>> that point we would either use it as is or someone would > >> improve > >>>> it. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> 4. Does it affect the schedule of producer rewrite? This > >> needs > >>>> its > >>>>>> own > >>>>>>>>>> stabilization and modification to existing metric dashboards > >>> if > >>>>> the > >>>>>>>>>> format > >>>>>>>>>> is changed. Many times such cost are not factored in and a > >>>> project > >>>>>> loses > >>>>>>>>>> time before realizing the extra time required to make a > >>> library > >>>> as > >>>>>> this > >>>>>>>>>> operational. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>>> Probably not. The metrics are going to change regardless of > >>>> whether > >>>>>> we use > >>>>>>>>> the same library or not. If we think this is better I don't > >> mind > >>>>>> putting > >>>>>>>>> in > >>>>>>>>> a little extra effort to get there. > >>>>>>>>> > >>>>>>>>> Irrespective I think this is probably not the right thing to > >>>>> optimize > >>>>>> for. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> I am sure we can do better when we write code to a specific > >>> use > >>>>>> case (in > >>>>>>>>>> this case, kafka) rather than building a generic library > >> that > >>>>> suits > >>>>>> all > >>>>>>>>>> (metrics-core) but I would like us to have answers to the > >>>>> questions > >>>>>>>>>> above > >>>>>>>>>> and be prepared before we proceed to support this with the > >>>>> producer > >>>>>>>>>> rewrite. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> Naturally we are all considering exactly these things, that is > >>>>>> exactly the > >>>>>>>>> reason I started the thread. > >>>>>>>>> > >>>>>>>>> -Jay > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On 2/11/14 6:28 PM, "Jun Rao" <jun...@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>>> Thanks for the detailed write-up. It's well thought > >> through. > >>> A > >>>>> few > >>>>>>>>>>> comments: > >>>>>>>>>>> > >>>>>>>>>>> 1. I have a couple of concerns on the percentiles. The > >> first > >>>>> issue > >>>>>> is > >>>>>>>>>> that > >>>>>>>>>>> It requires the user to know the value range. Since the > >> range > >>>> for > >>>>>>>>>> things > >>>>>>>>>>> like message size (in millions) is quite different from > >> those > >>>>> like > >>>>>>>>>> request > >>>>>>>>>>> time (less than 100), it's going to be hard to pick a good > >>>> global > >>>>>>>>>> default > >>>>>>>>>>> range. Different apps could be dealing with different > >> message > >>>>>> size. So > >>>>>>>>>>> they > >>>>>>>>>>> probably will have to customize the range. Another issue is > >>>> that > >>>>>> it can > >>>>>>>>>>> only report values at the bucket boundaries. So, if you > >> have > >>>> 1000 > >>>>>>>>>> buckets > >>>>>>>>>>> and a value range of 1 million, you will only see 1000 > >>> possible > >>>>>> values > >>>>>>>>>> as > >>>>>>>>>>> the quantile, which is probably too sparse. The > >>> implementation > >>>> of > >>>>>>>>>>> histogram > >>>>>>>>>>> in metrics-core keeps a fix size of samples, which avoids > >>> both > >>>>>> issues. > >>>>>>>>>>> > >>>>>>>>>>> 2. We need to document the 3-part metrics names better > >> since > >>>> it's > >>>>>> not > >>>>>>>>>>> obvious what the convention is. Also, currently the name of > >>> the > >>>>>> sensor > >>>>>>>>>> and > >>>>>>>>>>> the metrics defined in it are independent. Would it make > >>> sense > >>>> to > >>>>>> have > >>>>>>>>>> the > >>>>>>>>>>> sensor name be a prefix of the metric name? > >>>>>>>>>>> > >>>>>>>>>>> Overall, this approach seems to be cleaner than > >> metrics-core > >>> by > >>>>>>>>>> decoupling > >>>>>>>>>>> measuring and reporting. The main benefit of metrics-core > >>> seems > >>>>> to > >>>>>> be > >>>>>>>>>> the > >>>>>>>>>>> existing reporters. Since not that many people voted for > >>>>>> metrics-core, > >>>>>>>>>> I > >>>>>>>>>>> am > >>>>>>>>>>> ok with going with the new implementation. My only > >>>> recommendation > >>>>>> is to > >>>>>>>>>>> address the concern on percentiles. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> > >>>>>>>>>>> Jun > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Feb 6, 2014 at 12:51 PM, Jay Kreps < > >>>> jay.kr...@gmail.com> > >>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hey guys, > >>>>>>>>>>>> > >>>>>>>>>>>> I wanted to kick off a quick discussion of metrics with > >>>> respect > >>>>>> to > >>>>>>>>>> the > >>>>>>>>>>>> new > >>>>>>>>>>>> producer and consumer (and potentially the server). > >>>>>>>>>>>> > >>>>>>>>>>>> At a high level I think there are three approaches we > >> could > >>>>> take: > >>>>>>>>>>>> 1. Plain vanilla JMX > >>>>>>>>>>>> 2. Use Coda Hale (AKA Yammer) Metrics > >>>>>>>>>>>> 3. Do our own metrics (with JMX as one output) > >>>>>>>>>>>> > >>>>>>>>>>>> 1. Has the advantage that JMX is the most commonly used > >>> java > >>>>>> thing > >>>>>>>>>> and > >>>>>>>>>>>> plugs in reasonably to most metrics systems. JMX is > >>> included > >>>> in > >>>>>> the > >>>>>>>>>> JDK > >>>>>>>>>>>> so > >>>>>>>>>>>> it doesn't impose any additional dependencies on clients. > >>> It > >>>>> has > >>>>>> the > >>>>>>>>>>>> disadvantage that plain vanilla JMX is a pain to use. We > >>>> would > >>>>>> need a > >>>>>>>>>>>> bunch > >>>>>>>>>>>> of helper code for maintaining counters to make this > >>>>> reasonable. > >>>>>>>>>>>> > >>>>>>>>>>>> 2. Coda Hale metrics is pretty good and broadly used. It > >>>>>> supports JMX > >>>>>>>>>>>> output as well as direct output to many other types of > >>>> systems. > >>>>>> The > >>>>>>>>>>>> primary > >>>>>>>>>>>> downside we have had with Coda Hale has to do with the > >>>> clients > >>>>>> and > >>>>>>>>>>>> library > >>>>>>>>>>>> incompatibilities. We are currently on an older more > >>> popular > >>>>>> version. > >>>>>>>>>>>> The > >>>>>>>>>>>> newer version is a rewrite of the APIs and is > >> incompatible. > >>>>>>>>>> Originally > >>>>>>>>>>>> these were totally incompatible and people had to choose > >>> one > >>>> or > >>>>>> the > >>>>>>>>>>>> other. > >>>>>>>>>>>> I think that has been improved so now the new version is > >> a > >>>>>> totally > >>>>>>>>>>>> different package. But even in this case you end up with > >>> both > >>>>>>>>>> versions > >>>>>>>>>>>> if > >>>>>>>>>>>> you use Kafka and we are on a different version than you > >>>> which > >>>>> is > >>>>>>>>>> going > >>>>>>>>>>>> to > >>>>>>>>>>>> be pretty inconvenient. > >>>>>>>>>>>> > >>>>>>>>>>>> 3. Doing our own has the downside of potentially > >>> reinventing > >>>>> the > >>>>>>>>>> wheel, > >>>>>>>>>>>> and > >>>>>>>>>>>> potentially needing to work out any bugs in our code. The > >>>>> upsides > >>>>>>>>>> would > >>>>>>>>>>>> depend on the how good the reinvention was. As it > >> happens I > >>>>> did a > >>>>>>>>>> quick > >>>>>>>>>>>> (~900 loc) version of a metrics library that is under > >>>>>>>>>>>> kafka.common.metrics. > >>>>>>>>>>>> I think it has some advantages over the Yammer metrics > >>>> package > >>>>>> for > >>>>>>>>>> our > >>>>>>>>>>>> usage beyond just not causing incompatibilities. I will > >>>>> describe > >>>>>> this > >>>>>>>>>>>> code > >>>>>>>>>>>> so we can discuss the pros and cons. Although I favor > >> this > >>>>>> approach I > >>>>>>>>>>>> have > >>>>>>>>>>>> no emotional attachment and wouldn't be too sad if I > >> ended > >>> up > >>>>>>>>>> deleting > >>>>>>>>>>>> it. > >>>>>>>>>>>> Here are javadocs for this code, though I haven't written > >>>> much > >>>>>>>>>>>> documentation yet since I might end up deleting it: > >>>>>>>>>>>> > >>>>>>>>>>>> Here is a quick overview of this library. > >>>>>>>>>>>> > >>>>>>>>>>>> There are three main public interfaces: > >>>>>>>>>>>> Metrics - This is a repository of metrics being > >> tracked. > >>>>>>>>>>>> Metric - A single, named numerical value being measured > >>>>> (i.e. a > >>>>>>>>>>>> counter). > >>>>>>>>>>>> Sensor - This is a thing that records values and > >> updates > >>>> zero > >>>>>> or > >>>>>>>>>> more > >>>>>>>>>>>> metrics > >>>>>>>>>>>> > >>>>>>>>>>>> So let's say we want to track three values about message > >>>> sizes; > >>>>>>>>>>>> specifically say we want to record the average, the > >>> maximum, > >>>>> the > >>>>>>>>>> total > >>>>>>>>>>>> rate > >>>>>>>>>>>> of bytes being sent, and a count of messages. Then we > >> would > >>>> do > >>>>>>>>>> something > >>>>>>>>>>>> like this: > >>>>>>>>>>>> > >>>>>>>>>>>> // setup code > >>>>>>>>>>>> Metrics metrics = new Metrics(); // this is a global > >>>>>> "singleton" > >>>>>>>>>>>> Sensor sensor = > >>>>>> metrics.sensor("kafka.producer.message.sizes"); > >>>>>>>>>>>> sensor.add("kafka.producer.message-size.avg", new > >>> Avg()); > >>>>>>>>>>>> sensor.add("kafka.producer.message-size.max", new > >>> Max()); > >>>>>>>>>>>> sensor.add("kafka.producer.bytes-sent-per-sec", new > >>>> Rate()); > >>>>>>>>>>>> sensor.add("kafka.producer.message-count", new > >> Count()); > >>>>>>>>>>>> > >>>>>>>>>>>> // now when we get a message we do this > >>>>>>>>>>>> sensor.record(messageSize); > >>>>>>>>>>>> > >>>>>>>>>>>> The above code creates the global metrics repository, > >>>> creates a > >>>>>>>>>> single > >>>>>>>>>>>> Sensor, and defines 5 named metrics that are updated by > >>> that > >>>>>> Sensor. > >>>>>>>>>>>> > >>>>>>>>>>>> Like Yammer Metrics (YM) I allow you to plug in > >>> "reporters", > >>>>>>>>>> including a > >>>>>>>>>>>> JMX reporter. Unlike the Coda Hale JMX reporter the > >>> reporter > >>>> I > >>>>>> have > >>>>>>>>>> keys > >>>>>>>>>>>> off the metric names not the Sensor names, which I think > >> is > >>>> an > >>>>>>>>>>>> improvement--I just use the convention that the last > >>> portion > >>>> of > >>>>>> the > >>>>>>>>>>>> name is > >>>>>>>>>>>> the attribute name, the second to last is the mbean name, > >>> and > >>>>> the > >>>>>>>>>> rest > >>>>>>>>>>>> is > >>>>>>>>>>>> the package. So in the above example there is a producer > >>>> mbean > >>>>>> that > >>>>>>>>>> has > >>>>>>>>>>>> a > >>>>>>>>>>>> avg and max attribute and a producer mbean that has a > >>>>>>>>>> bytes-sent-per-sec > >>>>>>>>>>>> and message-count attribute. This is nice because you can > >>>>>> logically > >>>>>>>>>>>> group > >>>>>>>>>>>> the values reported irrespective of where in the program > >>> they > >>>>> are > >>>>>>>>>>>> computed--that is an mbean can logically group attributes > >>>>>> computed > >>>>>>>>>> off > >>>>>>>>>>>> different sensors. This means you can report values by > >>>> logical > >>>>>>>>>>>> subsystem. > >>>>>>>>>>>> > >>>>>>>>>>>> I also allow the concept of hierarchical Sensors which I > >>>> think > >>>>>> is a > >>>>>>>>>> good > >>>>>>>>>>>> convenience. I have noticed a common pattern in systems > >>> where > >>>>> you > >>>>>>>>>> need > >>>>>>>>>>>> to > >>>>>>>>>>>> roll up the same values along different dimensions. An > >>> simple > >>>>>>>>>> example is > >>>>>>>>>>>> metrics about qps, data rate, etc on the broker. These we > >>>> want > >>>>> to > >>>>>>>>>>>> capture > >>>>>>>>>>>> in aggregate, but also broken down by topic-id. You can > >> do > >>>> this > >>>>>>>>>> purely > >>>>>>>>>>>> by > >>>>>>>>>>>> defining the sensor hierarchy: > >>>>>>>>>>>> Sensor allSizes = metrics.sensor("kafka.producer.sizes"); > >>>>>>>>>>>> Sensor topicSizes = metrics.sensor("kafka.producer." + > >>> topic > >>>> + > >>>>>>>>>>>> ".sizes", > >>>>>>>>>>>> allSizes); > >>>>>>>>>>>> Now each actual update will go to the appropriate > >>> topicSizes > >>>>>> sensor > >>>>>>>>>>>> (based > >>>>>>>>>>>> on the topic name), but allSizes metrics will get updated > >>>> too. > >>>>> I > >>>>>> also > >>>>>>>>>>>> support multiple parents for each sensor as well as > >>> multiple > >>>>>> layers > >>>>>>>>>> of > >>>>>>>>>>>> hiearchy, so you can define a more elaborate DAG of > >>> sensors. > >>>> An > >>>>>>>>>> example > >>>>>>>>>>>> of > >>>>>>>>>>>> how this would be useful is if you wanted to record your > >>>>> metrics > >>>>>>>>>> broken > >>>>>>>>>>>> down by topic AND client id as well as the global > >>> aggregate. > >>>>>>>>>>>> > >>>>>>>>>>>> Each metric can take a configurable Quota value which > >>> allows > >>>> us > >>>>>> to > >>>>>>>>>> limit > >>>>>>>>>>>> the maximum value of that sensor. This is intended for > >> use > >>> on > >>>>> the > >>>>>>>>>>>> server as > >>>>>>>>>>>> part of our Quota implementation. The way this works is > >>> that > >>>>> you > >>>>>>>>>> record > >>>>>>>>>>>> metrics as usual: > >>>>>>>>>>>> mySensor.record(42.0) > >>>>>>>>>>>> However if this event occurance causes one of the metrics > >>> to > >>>>>> exceed > >>>>>>>>>> its > >>>>>>>>>>>> maximum allowable value (the quota) this call will throw > >> a > >>>>>>>>>>>> QuotaViolationException. The cool thing about this is > >> that > >>> it > >>>>>> means > >>>>>>>>>> we > >>>>>>>>>>>> can > >>>>>>>>>>>> define quotas on anything we capture metrics for, which I > >>>> think > >>>>>> is > >>>>>>>>>>>> pretty > >>>>>>>>>>>> cool. > >>>>>>>>>>>> > >>>>>>>>>>>> Another question is how to handle windowing of the > >> values? > >>>>>> Metrics > >>>>>>>>>> want > >>>>>>>>>>>> to > >>>>>>>>>>>> record the "current" value, but the definition of current > >>> is > >>>>>>>>>> inherently > >>>>>>>>>>>> nebulous. A few of the obvious gotchas are that if you > >>> define > >>>>>>>>>> "current" > >>>>>>>>>>>> to > >>>>>>>>>>>> be a number of events you can end up measuring an > >>> arbitrarily > >>>>>> long > >>>>>>>>>>>> window > >>>>>>>>>>>> of time if the event rate is low (e.g. you think you are > >>>>> getting > >>>>>> 50 > >>>>>>>>>>>> messages/sec because that was the rate yesterday when all > >>>>> events > >>>>>>>>>>>> topped). > >>>>>>>>>>>> > >>>>>>>>>>>> Here is how I approach this. All the metrics use the same > >>>>>> windowing > >>>>>>>>>>>> approach. We define a single window by a length of time > >> or > >>>>>> number of > >>>>>>>>>>>> values > >>>>>>>>>>>> (you can use either or both--if both the window ends when > >>>>>> *either* > >>>>>>>>>> the > >>>>>>>>>>>> time > >>>>>>>>>>>> bound or event bound is hit). The typical problem with > >> hard > >>>>>> window > >>>>>>>>>>>> boundaries is that at the beginning of the window you > >> have > >>> no > >>>>>> data > >>>>>>>>>> and > >>>>>>>>>>>> the > >>>>>>>>>>>> first few samples are too small to be a valid sample. > >>>> (Consider > >>>>>> if > >>>>>>>>>> you > >>>>>>>>>>>> were > >>>>>>>>>>>> keeping an avg and the first value in the window happens > >> to > >>>> be > >>>>>> very > >>>>>>>>>> very > >>>>>>>>>>>> high, if you check the avg at this exact time you will > >>>> conclude > >>>>>> the > >>>>>>>>>> avg > >>>>>>>>>>>> is > >>>>>>>>>>>> very high but on a sample size of one). One simple fix > >>> would > >>>> be > >>>>>> to > >>>>>>>>>>>> always > >>>>>>>>>>>> report the last complete window, however this is not > >>>>> appropriate > >>>>>> here > >>>>>>>>>>>> because (1) we want to drive quotas off it so it needs to > >>> be > >>>>>> current, > >>>>>>>>>>>> and > >>>>>>>>>>>> (2) since this is for monitoring you kind of care more > >>> about > >>>>> the > >>>>>>>>>> current > >>>>>>>>>>>> state. The ideal solution here would be to define a > >>> backwards > >>>>>> looking > >>>>>>>>>>>> sliding window from the present, but many statistics are > >>>>> actually > >>>>>>>>>> very > >>>>>>>>>>>> hard > >>>>>>>>>>>> to compute in this model without retaining all the values > >>>> which > >>>>>>>>>> would be > >>>>>>>>>>>> hopelessly inefficient. My solution to this is to keep a > >>>>>> configurable > >>>>>>>>>>>> number of windows (default is two) and combine them for > >> the > >>>>>> estimate. > >>>>>>>>>>>> So in > >>>>>>>>>>>> a two sample case depending on when you ask you have > >>> between > >>>>> one > >>>>>> and > >>>>>>>>>> two > >>>>>>>>>>>> complete samples worth of data to base the answer off of. > >>>>>> Provided > >>>>>>>>>> the > >>>>>>>>>>>> sample window is large enough to get a valid result this > >>>>>> satisfies > >>>>>>>>>> both > >>>>>>>>>>>> of > >>>>>>>>>>>> my criteria of incorporating the most recent data and > >>> having > >>>>>>>>>> reasonable > >>>>>>>>>>>> variance at all times. > >>>>>>>>>>>> > >>>>>>>>>>>> Another approach is to use an exponential weighting > >> scheme > >>> to > >>>>>> combine > >>>>>>>>>>>> all > >>>>>>>>>>>> history but emphasize the recent past. I have not done > >> this > >>>> as > >>>>> it > >>>>>>>>>> has a > >>>>>>>>>>>> lot > >>>>>>>>>>>> of issues for practical operational metrics. I'd be happy > >>> to > >>>>>>>>>> elaborate > >>>>>>>>>>>> on > >>>>>>>>>>>> this if anyone cares... > >>>>>>>>>>>> > >>>>>>>>>>>> The window size for metrics has a global default which > >> can > >>> be > >>>>>>>>>>>> overridden at > >>>>>>>>>>>> either the sensor or individual metric level. > >>>>>>>>>>>> > >>>>>>>>>>>> In addition to these time series values the user can > >>> directly > >>>>>> expose > >>>>>>>>>>>> some > >>>>>>>>>>>> method of their choosing JMX-style by implementing the > >>>>> Measurable > >>>>>>>>>>>> interface > >>>>>>>>>>>> and registering that value. E.g. > >>>>>>>>>>>> metrics.addMetric("my.metric", new Measurable() { > >>>>>>>>>>>> public double measure(MetricConfg config, long now) { > >>>>>>>>>>>> return this.calculateValueToExpose(); > >>>>>>>>>>>> } > >>>>>>>>>>>> }); > >>>>>>>>>>>> This is useful for exposing things like the accumulator > >>> free > >>>>>> memory. > >>>>>>>>>>>> > >>>>>>>>>>>> The set of metrics is extensible, new metrics can be > >> added > >>> by > >>>>>> just > >>>>>>>>>>>> implementing the appropriate interfaces and registering > >>> with > >>>> a > >>>>>>>>>> sensor. I > >>>>>>>>>>>> implement the following metrics: > >>>>>>>>>>>> total - the sum of all values from the given sensor > >>>>>>>>>>>> count - a windowed count of values from the sensor > >>>>>>>>>>>> avg - the sample average within the windows > >>>>>>>>>>>> max - the max over the windows > >>>>>>>>>>>> min - the min over the windows > >>>>>>>>>>>> rate - the rate in the windows (e.g. the total or count > >>>>>> divided by > >>>>>>>>>> the > >>>>>>>>>>>> ellapsed time) > >>>>>>>>>>>> percentiles - a collection of percentiles computed over > >>> the > >>>>>> window > >>>>>>>>>>>> > >>>>>>>>>>>> My approach to percentiles is a little different from the > >>>>> yammer > >>>>>>>>>> metrics > >>>>>>>>>>>> package. My complaint about the yammer metrics approach > >> is > >>>> that > >>>>>> it > >>>>>>>>>> uses > >>>>>>>>>>>> rather expensive sampling and uses kind of a lot of > >> memory > >>> to > >>>>>> get a > >>>>>>>>>>>> reasonable sample. This is problematic for per-topic > >>>>>> measurements. > >>>>>>>>>>>> > >>>>>>>>>>>> Instead I use a fixed range for the histogram (e.g. 0.0 > >> to > >>>>>> 30000.0) > >>>>>>>>>>>> which > >>>>>>>>>>>> directly allows you to specify the desired memory use. > >> Any > >>>>> value > >>>>>>>>>> below > >>>>>>>>>>>> the > >>>>>>>>>>>> minimum is recorded as -Infinity and any value above the > >>>>> maximum > >>>>>> as > >>>>>>>>>>>> +Infinity. I think this is okay as all metrics have an > >>>> expected > >>>>>> range > >>>>>>>>>>>> except for latency which can be arbitrarily large, but > >> for > >>>> very > >>>>>> high > >>>>>>>>>>>> latency there is no need to model it exactly (e.g. 30 > >>>> seconds + > >>>>>>>>>> really > >>>>>>>>>>>> is > >>>>>>>>>>>> effectively infinite). Within the range values are > >> recorded > >>>> in > >>>>>>>>>> buckets > >>>>>>>>>>>> which can be either fixed width or increasing width. The > >>>>>> increasing > >>>>>>>>>>>> width > >>>>>>>>>>>> is analogous to the idea of significant figures, that is > >> if > >>>>> your > >>>>>>>>>> value > >>>>>>>>>>>> is > >>>>>>>>>>>> in the range 0-10 you might want to be accurate to within > >>>> 1ms, > >>>>>> but if > >>>>>>>>>>>> it is > >>>>>>>>>>>> 20000 there is no need to be so accurate. I implemented a > >>>>> linear > >>>>>>>>>> bucket > >>>>>>>>>>>> size where the Nth bucket has width proportional to N. An > >>>>>> exponential > >>>>>>>>>>>> bucket size would also be sensible and could likely be > >>>> derived > >>>>>>>>>> directly > >>>>>>>>>>>> from the floating point representation of a the value. > >>>>>>>>>>>> > >>>>>>>>>>>> I'd like to get some feedback on this metrics code and > >>> make a > >>>>>>>>>> decision > >>>>>>>>>>>> on > >>>>>>>>>>>> whether we want to use it before I actually go ahead and > >>> add > >>>>> all > >>>>>> the > >>>>>>>>>>>> instrumentation in the code (otherwise I'll have to redo > >> it > >>>> if > >>>>> we > >>>>>>>>>> switch > >>>>>>>>>>>> approaches). So the next topic of discussion will be > >> which > >>>>> actual > >>>>>>>>>>>> metrics > >>>>>>>>>>>> to add. > >>>>>>>>>>>> > >>>>>>>>>>>> -Jay > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > >