Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-11 Thread Becket Qin
Hi Ismael, Yes, a follow up KIP after the controller code settles down sounds good. Thanks, Jiangjie (Becket) Qin On Wed, May 10, 2017 at 6:11 PM, Ismael Juma wrote: > Thanks Jun. Discussed this offline with Onur and Jun and I believe there's > agreement so updated the KIP: > > https://cwiki.

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-10 Thread Ismael Juma
Thanks Jun. Discussed this offline with Onur and Jun and I believe there's agreement so updated the KIP: https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action? pageId=69407758&selectedPageVersions=8&selectedPageVersions=7 Ismael On Wed, May 10, 2017 at 4:46 PM, Jun Rao wrote: > H

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-10 Thread Jun Rao
Hi, Onur, We probably don't want to do the 1-to-1 mapping from the event type to the controller state since some of the event types are implementation details. How about the following mapping? 0 - idle 1 - controller change (Startup, ControllerChange, Reelect) 2 - broker change (BrokerChange) 3 -

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-09 Thread Onur Karaman
@Ismael, Jun After bringing up an earlier point twice now, it still doesn't feel like it's been commented on/addressed, so I'm going to give it one more shot: Assuming that ControllerState should reflect the current event being processed, the KIP is missing states. The controller currently has 14

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-09 Thread Ismael Juma
Becket, Are you OK with extending the metrics via a subsequent KIP (assuming that what we're doing here doesn't prevent that)? The KIP freeze is tomorrow (although I will give it an extra day or two as many in the community have been attending the Kafka Summit this week), so we should avoid increa

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-09 Thread Jun Rao
Hi, Becket, q10. The reason why there is not a timer metric for broker change event is that the controller currently always has a LeaderElectionRateAndTimeMs timer metric (in ControllerStats). q11. I agree that that it's useful to know the queue time in the controller event queue and suggested th

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Becket Qin
@Ismael, About the stage and event type. Yes, I think each event handling should have those stages covered. It is similar to what we are doing for the requests on the broker side. We have benefited from such systematic metric structure a lot so I think it would be worth following the same way in t

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Onur Karaman
I had a similar comment to Becket but accidentally posted it on the vote thread last Friday. From that thread: "I noticed that both the ControllerState metric and the *RateAndTimeMs metrics only cover a subset of the controller event types. Was this intentional?" I think it makes most sense to jus

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Ismael Juma
Hi Becket, Thanks for the feedback. Comments inline. On Tue, May 9, 2017 at 12:19 AM, Becket Qin wrote: > > q10. With event loop based controller design, it seems natrural to have the > processing time for each controller event type. In that case, the current > metrics seem not covering all the

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Becket Qin
Hi Ismael and Jun, Thanks for the KIP. It is very useful. A couple of comments: q10. With event loop based controller design, it seems natrural to have the processing time for each controller event type. In that case, the current metrics seem not covering all the event processing time? e.g. (brok

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-08 Thread Joel Koshy
Hi Ismael, > > What about a broker that is not the controller? Would you need a separate > > idle-not-controller state? > > > Do we need a separate state or can users just use the ActiveControllerCount > metric to check if the broker is the controller? > Sure - the ACC metric should be sufficien

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-04 Thread Ismael Juma
Hi Joel, Thanks for the feedback. Comments inline. On Wed, May 3, 2017 at 7:52 PM, Joel Koshy wrote: > Yes - that is roughly what I was thinking (although deletes are no longer > long running). Also, what is the "steady-state" controller state? Idle? > Yes. > What about a broker that is not

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-04 Thread Ismael Juma
Hi Onur, Comments inline. On Wed, May 3, 2017 at 6:54 PM, Onur Karaman wrote: > Regarding the ControllerState and the potential for overlap, I think it > depends on our definition of controller state. While KAFKA-5028 allows only > a single ControllerEvent to be processed at a time, it still al

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-03 Thread Joel Koshy
On Wed, May 3, 2017 at 10:54 AM, Onur Karaman wrote: > Regarding the ControllerState and the potential for overlap, I think it > depends on our definition of controller state. While KAFKA-5028 allows only > a single ControllerEvent to be processed at a time, it still allows > interleavings for lo

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-03 Thread Onur Karaman
Regarding the ControllerState and the potential for overlap, I think it depends on our definition of controller state. While KAFKA-5028 allows only a single ControllerEvent to be processed at a time, it still allows interleavings for long-lasting actions like partition reassignment and topic deleti

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-03 Thread Ismael Juma
Thanks for the feedback Tom, Joel and Jun. I updated the KIP in the following way: 1. Removed ControlledShutdownRateAndTimeMs 2. Added QueueSize and QueueTimeMs 3. Renamed FailedIsrUpdateRate to FailedIsrUpdatesPerSec for consistency with other metrics in the Partition class 4. Mentioned that Yam

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-05-01 Thread Jun Rao
Hi, Joel, 1. Yes, the long term goal is to migrate the metrics on the broker to kafka-metrics. Since many people are using Yammer reporters, we probably need to support a few popular ones in kafka-metrics before migrating. Until that happens, we probably want to stick with the Yammer metrics on th

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Joel Koshy
Thanks for the KIP - couple of comments: - Do you intend to actually use yammer metrics? or use kafka-metrics and split the timer into an explicit rate and time? I think long term we ought to move off yammer and use kafka-metrics only. Actually either is fine, but we should ideally use only one in

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Tom Crayford
Ismael, Great, that sounds lovely. I'd like a `Timer` (using yammer metrics parlance) over how long it took to process the event, so we can get at p99 and max times spent processing things. Maybe we could even do a log at warning level if event processing takes over some timeout? Thanks Tom On

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Ismael Juma
Hi Tom, Yes, the plan is to merge KAFKA-5028 first and then use a lock-free approach for the new metrics. I considered mentioning that in the KIP given KAFKA-5120, but didn't in the end. I'll add it to make it clear. Regarding locks, they are removed by KAFKA-5028, as you say. So, if I understan

Re: [DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Tom Crayford
Hi, We (Heroku) are very excited about this KIP, as we've struggled a bit with controller stability recently. Having these additional metrics would be wonderful. I'd like to ensure polling these metrics *doesn't* hold any locks etc, because, as noted in https://issues.apache.org/jira/browse/KAFKA

[DISCUSS] KIP-143: Controller Health Metrics

2017-04-27 Thread Ismael Juma
Hi all, We've posted "KIP-143: Controller Health Metrics" for discussion: https://cwiki.apache.org/confluence/display/KAFKA/KIP-143%3A+Controller+Health+Metrics Please take a look. Your feedback is appreciated. Thanks, Ismael