Re: [DISCUSS] KIP-143: Controller Health Metrics

Joel Koshy Wed, 03 May 2017 11:52:23 -0700

On Wed, May 3, 2017 at 10:54 AM, Onur Karaman <onurkaraman.apa...@gmail.com>
wrote:


> Regarding the ControllerState and the potential for overlap, I think it
> depends on our definition of controller state. While KAFKA-5028 allows only
> a single ControllerEvent to be processed at a time, it still allows
> interleavings for long-lasting actions like partition reassignment and
> topic deletion. For instance, a topic can get created while another topic
> is undergoing partition reassignment. In that sense, there is overlap.
> However, in the sense of the ControllerEvents being processed, there can be
> no overlap.
>

Yes - that is roughly what I was thinking (although deletes are no longer
long running). Also, what is the "steady-state" controller state? Idle?
What about a broker that is not the controller? Would you need a separate
idle-not-controller state? Given that most of the state changes are short
we would just see blips in the best case and nothing in the worst case
(depending on how often metrics get sampled). It would only help if you
want to visually detect any transitions that are taking an inordinate
duration.



> > 1. Yes, the long term goal is to migrate the metrics on the broker to
>> > kafka-metrics. Since many people are using Yammer reporters, we probably
>> > need to support a few popular ones in kafka-metrics before migrating.
>> Until
>> > that happens, we probably want to stick with the Yammer metrics on the
>> > server side unless we depend on features from kafka-metrics (e.g,
>> quota).
>>
>
Ok - my thought was since we are already using kafka-metrics for quotas and
selector metrics we could just do the same for this (and any *new* metrics
on the broker).


> 4. Metrics #2 and #3. The issue with relying on metric #1 is that the
>> > latter is sensitive to the frequency of metric collection. For example,
>> if
>> > the starting of the controller takes 30 secs and the metric is only
>> > collected once a minute, one may not know the latency with just metric
>> #1,
>> > but will know the latency with metrics #2 and #3. Are you concerned
>> about
>> > the memory overhead of histograms? It doesn't seem that a couple of more
>> > histograms will hurt.
>>
>
No I don't have concerns about the histograms - just wondering if it is
useful enough to have these in the first place, but your summary makes
sense.

Joel


> >
>> > Hi, Isamel,
>> >
>> > Thanks the for proposal. A couple of more comments.,
>> >
>> > 10. It would be useful to add a new metrics for the controller queue
>> size.
>> > kafka.controller:type=ControllerStats,name=QueueSize
>> >
>> > 11. It would also be useful to know how long an event is waiting in the
>> > controller queue before being processing. Perhaps, we can add a
>> histogram
>> > metric like the following.
>> > kafka.controller:type=ControllerStats,name=QueueTimeMs
>> >
>> > Jun
>> >
>> > On Thu, Apr 27, 2017 at 11:39 AM, Joel Koshy <jjkosh...@gmail.com>
>> wrote:
>> >
>> > > Thanks for the KIP - couple of comments:
>> > > - Do you intend to actually use yammer metrics? or use kafka-metrics
>> and
>> > > split the timer into an explicit rate and time? I think long term we
>> > ought
>> > > to move off yammer and use kafka-metrics only. Actually either is
>> fine,
>> > but
>> > > we should ideally use only one in the long term - and I thought the
>> plan
>> > > was to use kafka-metrics.
>> > > - metric #9 appears to be redundant since we already have per-API
>> request
>> > > rate and time metrics.
>> > > - Same for metric #4, #5 (as there are request stats for
>> > > DeleteTopicRequest - although it is possible for users to trigger
>> deletes
>> > > via ZK)
>> > > - metric #2, #3 are potentially useful, but a bit overkill for a
>> > > histogram. Alternative is to stick to last known value, but that
>> doesn't
>> > > play well with alerts if a high value isn't reset/decayed. Perhaps
>> metric
>> > > #1 would be sufficient to gauge slow start/resignation transitions.
>> > > - metric #1 - some of the states may actually overlap
>> > > - I don't actually understand the semantics of metric #6. Is it rate
>> of
>> > > partition reassignment triggers? does the number of partitions matter?
>> > >
>> > > Joel
>> > >
>> > > On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford <tcrayf...@heroku.com>
>> > > wrote:
>> > >
>> > >> Ismael,
>> > >>
>> > >> Great, that sounds lovely.
>> > >>
>> > >> I'd like a `Timer` (using yammer metrics parlance) over how long it
>> took
>> > >> to
>> > >> process the event, so we can get at p99 and max times spent
>> processing
>> > >> things. Maybe we could even do a log at warning level if event
>> > processing
>> > >> takes over some timeout?
>> > >>
>> > >> Thanks
>> > >>
>> > >> Tom
>> > >>
>> > >> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma <ism...@juma.me.uk>
>> wrote:
>> > >>
>> > >> > Hi Tom,
>> > >> >
>> > >> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
>> > >> > approach for the new  metrics. I considered mentioning that in the
>> KIP
>> > >> > given KAFKA-5120, but didn't in the end. I'll add it to make it
>> clear.
>> > >> >
>> > >> > Regarding locks, they are removed by KAFKA-5028, as you say. So,
>> if I
>> > >> > understand correctly, you are suggesting an event processing rate
>> > metric
>> > >> > with event type as a tag? Onur and Jun, what do you think?
>> > >> >
>> > >> > Ismael
>> > >> >
>> > >> > On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford <
>> tcrayf...@heroku.com>
>> > >> > wrote:
>> > >> >
>> > >> > > Hi,
>> > >> > >
>> > >> > > We (Heroku) are very excited about this KIP, as we've struggled a
>> > bit
>> > >> > with
>> > >> > > controller stability recently. Having these additional metrics
>> would
>> > >> be
>> > >> > > wonderful.
>> > >> > >
>> > >> > > I'd like to ensure polling these metrics *doesn't* hold any locks
>> > etc,
>> > >> > > because, as noted in https://issues.apache.org/
>> > jira/browse/KAFKA-5120
>> > >> ,
>> > >> > > that
>> > >> > > lock can be held for quite some time. This may become not an
>> issue
>> > as
>> > >> of
>> > >> > > KAFKA-5028 though.
>> > >> > >
>> > >> > > Lastly, I'd love to see some metrics around how long the
>> controller
>> > >> > spends
>> > >> > > inside its lock. We've been tracking an issue (
>> > >> > > https://issues.apache.org/jira/browse/KAFKA-5116) where it can
>> hold
>> > >> the
>> > >> > > lock for many, many minutes in a zk client listener thread when
>> > >> > responding
>> > >> > > to a single request. I'm not sure how that plays into
>> > >> > > https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume
>> > will
>> > >> > land
>> > >> > > before this metrics patch), but it feels like there will be
>> > equivalent
>> > >> > > problems ("how long does it spend processing any individual
>> message
>> > >> from
>> > >> > > the queue, broken down by message type").
>> > >> > >
>> > >> > > These are minor improvements though, the addition of more
>> metrics to
>> > >> the
>> > >> > > controller is already going to be very helpful.
>> > >> > >
>> > >> > > Thanks
>> > >> > >
>> > >> > > Tom Crayford
>> > >> > > Heroku Kafka
>> > >> > >
>> > >> > > On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma <ism...@juma.me.uk>
>> > >> wrote:
>> > >> > >
>> > >> > > > Hi all,
>> > >> > > >
>> > >> > > > We've posted "KIP-143: Controller Health Metrics" for
>> discussion:
>> > >> > > >
>> > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> > >> > > > 143%3A+Controller+Health+Metrics
>> > >> > > >
>> > >> > > > Please take a look. Your feedback is appreciated.
>> > >> > > >
>> > >> > > > Thanks,
>> > >> > > > Ismael
>> > >> > > >
>> > >> > >
>> > >> >
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: [DISCUSS] KIP-143: Controller Health Metrics

Reply via email to