Jun, Thank you, your suggestions sound good. I have updated the KIP.
Regards, Rajini On Tue, Aug 29, 2017 at 9:12 PM, Jun Rao <j...@confluent.io> wrote: > Hi, Rajini, > > Thanks for the updated KIP. I agree that those additional metrics can be > useful. I was thinking what would an admin do if the value of one of > those metrics is abnormal. An admin probably want to determine which client > causes the abnormaly. So, a couple of more comments below. > > 10. About FetchDownConversionsMs. Should we model it at the topic level or > at the request level as a new latency component like messageConversionTime > in addition to the existing localTime, requestQueueTime, etc? The benefit > of the latter is that we can include it in the request log and use it to > figure out which client is doing the conversion. Also, should we also track > the conversion time in the producer? > > 11. About ProduceBatchSize. Currently, the largest chunk of memory is > allocated at the request level (mostly produce request). So, instead > of ProduceBatchSize > , perhaps we can add a request size metric for each type of request and > also include it in the request log. This way, if there is a memory issue, > we can trace it back to a particular client. Similarly, for > ProduceUncompressedBatchSize, > would it be better to track it at the request level as something like > temporaryMemorySize and include it in the request log? > > Thanks, > > Jun > > On Tue, Aug 29, 2017 at 10:07 AM, Roger Hoover <roger.hoo...@gmail.com> > wrote: > > > Great suggestions, Ismael and thanks for incorporating them, Rajini. > > > > Tracking authentication success and failures (#3) across listeners seems > > very useful for cluster administrators to identify misconfigured client > or > > bad actors, especially until all clients implement KIP-152 which will add > > an explicit error code for authentication failures. Currently, clients > > just get disconnected so it's hard to distinguish authentication failures > > from any other error that can cause disconnect. This broker-side metric > is > > useful regardless but can help fill this gap until all clients support > KIP > > 152. > > > > Just to be clear, the ones called `successful-authentication-rate` and > > `failed-authentication-rate` will also have failed-authentication-count > > and successful-authentication-count to match KIP 187? > > > > On Tue, Aug 29, 2017 at 7:26 AM, Rajini Sivaram <rajinisiva...@gmail.com > > > > wrote: > > > > > Hi Ismael, > > > > > > Thank you for the suggestions. The additional metrics sound very useful > > and > > > I have added them to the KIP. > > > > > > Regards, > > > > > > Rajini > > > > > > On Tue, Aug 29, 2017 at 5:34 AM, Ismael Juma <ism...@juma.me.uk> > wrote: > > > > > > > Hi Rajini, > > > > > > > > There are a few other metrics that could potentially be useful. I'd > be > > > > interested in what you and the community thinks: > > > > > > > > 1. The KIP currently includes `FetchDownConversionsPerSec`, which is > > > > useful. In the common case, one would want to avoid down conversion > by > > > > using the lower message format supported by most of the consumers. > > > However, > > > > there are good reasons to use a newer message format even if there > are > > > some > > > > legacy consumers around. It would be good to quantify the cost of > these > > > > consumers a bit more clearly. Looking at the request metric > > `LocalTimeMs` > > > > provides a hint, but it may be useful to have a dedicated > > > > `FetchDownConversionsMs` metric. > > > > > > > > 2. Large messages can cause GC issues (it's particularly bad if fetch > > > down > > > > conversion takes place). One can currently configure the max message > > > batch > > > > size per topic to keep this under control, but that is the size after > > > > compression. However, we decompress the batch to validate produce > > > requests > > > > and we decompress and recompress during fetch downconversion). It > would > > > be > > > > helpful to have topic metrics for the produce message batch size > after > > > > decompression (and perhaps compressed as well as that would help > > > understand > > > > the compression ratio). > > > > > > > > 3. Authentication success/failures per second. This is helpful to > > > > understand if some clients are misconfigured or if bad actors are > > trying > > > to > > > > authenticate. > > > > > > > > Thoughts? > > > > > > > > Ismael > > > > > > > > > > > > > > > > On Wed, Aug 23, 2017 at 2:53 AM, Jun Rao <j...@confluent.io> wrote: > > > > > > > > > Hi, Rajini, > > > > > > > > > > Yes, if those error metrics are registered dynamically, we could > > worry > > > > > about expiration later. > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Fri, Aug 18, 2017 at 1:55 AM, Rajini Sivaram < > > > rajinisiva...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > Perhaps we could register dynamically for now. If we find that > the > > > cost > > > > > of > > > > > > retaining these is high, we can add the code to expire them > later. > > Is > > > > > that > > > > > > ok? > > > > > > > > > > > > Regards, > > > > > > > > > > > > Rajini > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 18, 2017 at 9:41 AM, Ismael Juma <ism...@juma.me.uk> > > > > wrote: > > > > > > > > > > > > > Can we quantify the cost of having these metrics around if they > > are > > > > > > > dynamically registered? Given that the maximum is bounded at > > > > > development > > > > > > > time, is it really worth all the extra code? > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > On Fri, Aug 18, 2017 at 9:34 AM, Rajini Sivaram < > > > > > rajinisiva...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Jun, > > > > > > > > > > > > > > > > It feels more consistent to add errors as yammer metrics > > similar > > > to > > > > > > other > > > > > > > > request metrics. Perhaps we could add some code to track and > > > remove > > > > > > these > > > > > > > > if unused? It is a bit more work, but it would keep the > > externals > > > > > > > > consistent. > > > > > > > > > > > > > > > > Ismael/Manikumar, > > > > > > > > > > > > > > > > Agree that version as a String attribute makes more sense. > > > > > > Unfortunately, > > > > > > > > the whole KafkaMetric implementation is written around a > single > > > > > > "double" > > > > > > > > type, so introducing a new type is a big change. But I > suppose > > it > > > > can > > > > > > be > > > > > > > > done. I have updated the KIP. > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > Rajini > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 18, 2017 at 7:42 AM, Manikumar < > > > > > manikumar.re...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > I agree it will be good if we can add "commit id/version" > as > > > an > > > > > > > > > attribute value. > > > > > > > > > It will be easy to parse. But as of now, KafkaMetric > supports > > > > only > > > > > > > > > numerical values. > > > > > > > > > > > > > > > > > > On Fri, Aug 18, 2017 at 5:49 AM, Ismael Juma < > > > ism...@juma.me.uk> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Rajini, > > > > > > > > > > > > > > > > > > > > About the gauges, I was thinking that the attribute would > > be > > > > the > > > > > > > value > > > > > > > > > > (i.e. commit id or version). I understand that Kafka > > Metrics > > > > > > doesn't > > > > > > > > > > support this (unlike Yammer Metrics), but would it make > > sense > > > > to > > > > > > add? > > > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > > > On Thu, Aug 17, 2017 at 2:54 PM, Rajini Sivaram < > > > > > > > > rajinisiva...@gmail.com > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi Ismael, > > > > > > > > > > > > > > > > > > > > > > Thank you for the review. > > > > > > > > > > > > > > > > > > > > > > 1. Agree on keeping it simple with dynamic registration > > and > > > > no > > > > > > > > expiry. > > > > > > > > > > Will > > > > > > > > > > > wait for Jun's feedback before updating KIP. > > > > > > > > > > > 2. I have switched to two metrics for commit-id and > > version > > > > > (not > > > > > > > sure > > > > > > > > > if > > > > > > > > > > it > > > > > > > > > > > matches what you meant). I also added the client-id tag > > > which > > > > > is > > > > > > > used > > > > > > > > > in > > > > > > > > > > > all metrics from clients. > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > Rajini > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 17, 2017 at 10:55 AM, Ismael Juma < > > > > > ism...@juma.me.uk > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP, Rajini. I think this is helpful > > too. > > > A > > > > > few > > > > > > > > minor > > > > > > > > > > > > comments. > > > > > > > > > > > > > > > > > > > > > > > > 1. About the number of metrics and expiration, if we > > > > > > dynamically > > > > > > > > > > register > > > > > > > > > > > > metrics for the error codes, the number is likely to > be > > > > much > > > > > > > lower > > > > > > > > > than > > > > > > > > > > > > 30*30, probably less than 100. If we were using Kafka > > > > Metrics > > > > > > for > > > > > > > > > this, > > > > > > > > > > > we > > > > > > > > > > > > could easily add a long expiration period to be > > > > conservative, > > > > > > > but I > > > > > > > > > am > > > > > > > > > > > not > > > > > > > > > > > > sure this is supported by Yammer Metrics. If it is > not, > > > > > there's > > > > > > > an > > > > > > > > > > > argument > > > > > > > > > > > > for keeping it simple. > > > > > > > > > > > > > > > > > > > > > > > > 2. Would it make sense to use 2 gauges for the > version > > > and > > > > > > commit > > > > > > > > id? > > > > > > > > > > It > > > > > > > > > > > > seems more intuitive than having those values as > tags. > > > > > > > > > > > > > > > > > > > > > > > > Ismael > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 17, 2017 at 10:19 AM, Rajini Sivaram < > > > > > > > > > > > rajinisiva...@gmail.com> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Jun, > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for the review. > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Makes sense. I have updated the KIP. > > > > > > > > > > > > > 2. Moved to a new group ZooKeeperClient > > > > > > > > > > > > > 3. It is a gauge, so it will have a single > attribute > > > > called > > > > > > > Value > > > > > > > > > > with > > > > > > > > > > > a > > > > > > > > > > > > > constant value of 1. > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > > > > > > > Rajini > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 17, 2017 at 3:16 AM, Jun Rao < > > > > j...@confluent.io > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, Rajini, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks for the KIP. A few comments. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. We have 30+ requests and 30+ error code and > > > growing. > > > > > So, > > > > > > > the > > > > > > > > > > > > > combination > > > > > > > > > > > > > > can be large. Perhaps it's useful to expire an > > error > > > > > metric > > > > > > > if > > > > > > > > > it's > > > > > > > > > > > no > > > > > > > > > > > > > > longer updated after some time? We did something > > > > similar > > > > > > for > > > > > > > > the > > > > > > > > > > > quota > > > > > > > > > > > > > > metric. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. It's a bit weird to put the ZK latency metric > > > under > > > > > > > > > > > > > > type=SessionExpireListener. > > > > > > > > > > > > > > Perhaps it's more intuitive to put it in a > separate > > > > type. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. For the client version metric, since we > > > representing > > > > > > > > commit_id > > > > > > > > > > and > > > > > > > > > > > > > > version as tags in the metric name. So the mbean > > will > > > > > have > > > > > > no > > > > > > > > > > > > attributes? > > > > > > > > > > > > > > > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 4:05 PM, Roger Hoover < > > > > > > > > > > > roger.hoo...@gmail.com> > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I think it would useful to make clear somewhere > > for > > > > > each > > > > > > > > > metric, > > > > > > > > > > > the > > > > > > > > > > > > > > level > > > > > > > > > > > > > > > at which it's counted. I don't know all the > > > details > > > > of > > > > > > the > > > > > > > > > Kafka > > > > > > > > > > > > > > protocol > > > > > > > > > > > > > > > but it might be something like > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ProduceRequest, Fetch Request - counted at > > > > > per-partition > > > > > > > > level > > > > > > > > > > > > > > > All other requests are 1:1 with client > requests? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Roger > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 4:02 PM, Roger Hoover < > > > > > > > > > > > > roger.hoo...@gmail.com> > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Rajini, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for the KIP. These are very > helpful > > > > > > additions. > > > > > > > > > One > > > > > > > > > > > > > question > > > > > > > > > > > > > > > on > > > > > > > > > > > > > > > > the error code metrics: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Will the total error counting happen at the > the > > > > level > > > > > > of > > > > > > > > > topic > > > > > > > > > > > > > > partition? > > > > > > > > > > > > > > > > For example, if a single ProduceRequest > > contains > > > > > > messages > > > > > > > > to > > > > > > > > > > > append > > > > > > > > > > > > > to > > > > > > > > > > > > > > 3 > > > > > > > > > > > > > > > > partitions and say all 3 appends are > > successful, > > > > the > > > > > > > > counter > > > > > > > > > > > > > > > > for kafka.network:type=RequestMetrics,name= > > > > > > > > > > ErrorsPerSec,request= > > > > > > > > > > > > > > > ProduceRequest,error=0 > > > > > > > > > > > > > > > > will be incremented by 3? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Roger > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Aug 16, 2017 at 12:10 PM, Rajini > > Sivaram > > > < > > > > > > > > > > > > > > > rajinisiva...@gmail.com> > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> I have created a KIP to add some additional > > > > metrics > > > > > to > > > > > > > > > support > > > > > > > > > > > > > health > > > > > > > > > > > > > > > >> checks: > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> https://cwiki.apache.org/ > > > > > > confluence/display/KAFKA/KIP- > > > > > > > > > 188+-+ > > > > > > > > > > > > > > > >> Add+new+metrics+to+support+health+checks > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> Feedback and suggestions are welcome. > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> Regards, > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> Rajini > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >