Re: [DISCUSS] KIP-143: Controller Health Metrics

Onur Karaman Wed, 03 May 2017 10:54:48 -0700

Regarding the ControllerState and the potential for overlap, I think it
depends on our definition of controller state. While KAFKA-5028 allows only
a single ControllerEvent to be processed at a time, it still allows
interleavings for long-lasting actions like partition reassignment and
topic deletion. For instance, a topic can get created while another topic
is undergoing partition reassignment. In that sense, there is overlap.
However, in the sense of the ControllerEvents being processed, there can be
no overlap.


I also think adding the QueueSize and QueueTimeMs metrics could be a
premature move. I completely agree that these metrics would be valuable
given KAFKA-5028. However, I'm not sure whether the controller event queue
and controller thread as implemented today is actually here to stay or if
it's merely a first step in the controller redesign. Especially when
considering the possibility of moving away from the simple synchronous
zookeeper apis and having better control over handling zookeeper
disconnects and session expirations, it's possible that the queue and
thread could actually get ripped out of the controller and being part of
something more general, making these two controller-level metrics invalid.

On Wed, May 3, 2017 at 7:55 AM, Ismael Juma <[email protected]> wrote:

> Thanks for the feedback Tom, Joel and Jun.
>
> I updated the KIP in the following way:
>
> 1. Removed ControlledShutdownRateAndTimeMs
> 2. Added QueueSize and QueueTimeMs
> 3. Renamed FailedIsrUpdateRate to FailedIsrUpdatesPerSec for consistency
> with other metrics in the Partition class
> 4. Mentioned that Yammer metrics will be used
> 5. Noted that no Controller locks are acquired when retrieving metric
> values
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?
> pageId=69407758&selectedPageVersions=6&selectedPageVersions=4
>
> If there are no additional concerns, I will start a vote tomorrow.
>
> Ismael
>
> On Tue, May 2, 2017 at 2:35 AM, Jun Rao <[email protected]> wrote:
>
> > Hi, Joel,
> >
> > 1. Yes, the long term goal is to migrate the metrics on the broker to
> > kafka-metrics. Since many people are using Yammer reporters, we probably
> > need to support a few popular ones in kafka-metrics before migrating.
> Until
> > that happens, we probably want to stick with the Yammer metrics on the
> > server side unless we depend on features from kafka-metrics (e.g, quota).
> >
> > 2. Thanks for Onur, we now have moved to a single threaded model. So,
> none
> > of the event in metric #1 will overlap.
> >
> > 3. Metric #6 just track the rate/latency each time the controller is
> called
> > to initiate or resume the processing of a reassginment request.
> >
> > 4. Metrics #2 and #3. The issue with relying on metric #1 is that the
> > latter is sensitive to the frequency of metric collection. For example,
> if
> > the starting of the controller takes 30 secs and the metric is only
> > collected once a minute, one may not know the latency with just metric
> #1,
> > but will know the latency with metrics #2 and #3. Are you concerned about
> > the memory overhead of histograms? It doesn't seem that a couple of more
> > histograms will hurt.
> >
> > 5. Metric #9. Agreed. After KAFKA-5028, this will be reflected in the
> > remoteTimeMs of the controlled shutdown request.
> >
> > 6. Metrics #4 and #5. They are actually a bit different. The local time
> of
> > createTopic/deleteTopic just includes the time to add/delete the topic
> path
> > in ZK. The remote time includes the time that the controller processes
> the
> > request plus the time for the metadata to be propagated back to the
> > controller. So, knowing just the portion of the time spent in the
> > controller can still be useful.
> >
> > Hi, Isamel,
> >
> > Thanks the for proposal. A couple of more comments.,
> >
> > 10. It would be useful to add a new metrics for the controller queue
> size.
> > kafka.controller:type=ControllerStats,name=QueueSize
> >
> > 11. It would also be useful to know how long an event is waiting in the
> > controller queue before being processing. Perhaps, we can add a histogram
> > metric like the following.
> > kafka.controller:type=ControllerStats,name=QueueTimeMs
> >
> > Jun
> >
> > On Thu, Apr 27, 2017 at 11:39 AM, Joel Koshy <[email protected]>
> wrote:
> >
> > > Thanks for the KIP - couple of comments:
> > > - Do you intend to actually use yammer metrics? or use kafka-metrics
> and
> > > split the timer into an explicit rate and time? I think long term we
> > ought
> > > to move off yammer and use kafka-metrics only. Actually either is fine,
> > but
> > > we should ideally use only one in the long term - and I thought the
> plan
> > > was to use kafka-metrics.
> > > - metric #9 appears to be redundant since we already have per-API
> request
> > > rate and time metrics.
> > > - Same for metric #4, #5 (as there are request stats for
> > > DeleteTopicRequest - although it is possible for users to trigger
> deletes
> > > via ZK)
> > > - metric #2, #3 are potentially useful, but a bit overkill for a
> > > histogram. Alternative is to stick to last known value, but that
> doesn't
> > > play well with alerts if a high value isn't reset/decayed. Perhaps
> metric
> > > #1 would be sufficient to gauge slow start/resignation transitions.
> > > - metric #1 - some of the states may actually overlap
> > > - I don't actually understand the semantics of metric #6. Is it rate of
> > > partition reassignment triggers? does the number of partitions matter?
> > >
> > > Joel
> > >
> > > On Thu, Apr 27, 2017 at 8:04 AM, Tom Crayford <[email protected]>
> > > wrote:
> > >
> > >> Ismael,
> > >>
> > >> Great, that sounds lovely.
> > >>
> > >> I'd like a `Timer` (using yammer metrics parlance) over how long it
> took
> > >> to
> > >> process the event, so we can get at p99 and max times spent processing
> > >> things. Maybe we could even do a log at warning level if event
> > processing
> > >> takes over some timeout?
> > >>
> > >> Thanks
> > >>
> > >> Tom
> > >>
> > >> On Thu, Apr 27, 2017 at 3:59 PM, Ismael Juma <[email protected]>
> wrote:
> > >>
> > >> > Hi Tom,
> > >> >
> > >> > Yes, the plan is to merge KAFKA-5028 first and then use a lock-free
> > >> > approach for the new  metrics. I considered mentioning that in the
> KIP
> > >> > given KAFKA-5120, but didn't in the end. I'll add it to make it
> clear.
> > >> >
> > >> > Regarding locks, they are removed by KAFKA-5028, as you say. So, if
> I
> > >> > understand correctly, you are suggesting an event processing rate
> > metric
> > >> > with event type as a tag? Onur and Jun, what do you think?
> > >> >
> > >> > Ismael
> > >> >
> > >> > On Thu, Apr 27, 2017 at 3:47 PM, Tom Crayford <[email protected]
> >
> > >> > wrote:
> > >> >
> > >> > > Hi,
> > >> > >
> > >> > > We (Heroku) are very excited about this KIP, as we've struggled a
> > bit
> > >> > with
> > >> > > controller stability recently. Having these additional metrics
> would
> > >> be
> > >> > > wonderful.
> > >> > >
> > >> > > I'd like to ensure polling these metrics *doesn't* hold any locks
> > etc,
> > >> > > because, as noted in https://issues.apache.org/
> > jira/browse/KAFKA-5120
> > >> ,
> > >> > > that
> > >> > > lock can be held for quite some time. This may become not an issue
> > as
> > >> of
> > >> > > KAFKA-5028 though.
> > >> > >
> > >> > > Lastly, I'd love to see some metrics around how long the
> controller
> > >> > spends
> > >> > > inside its lock. We've been tracking an issue (
> > >> > > https://issues.apache.org/jira/browse/KAFKA-5116) where it can
> hold
> > >> the
> > >> > > lock for many, many minutes in a zk client listener thread when
> > >> > responding
> > >> > > to a single request. I'm not sure how that plays into
> > >> > > https://issues.apache.org/jira/browse/KAFKA-5028 (which I assume
> > will
> > >> > land
> > >> > > before this metrics patch), but it feels like there will be
> > equivalent
> > >> > > problems ("how long does it spend processing any individual
> message
> > >> from
> > >> > > the queue, broken down by message type").
> > >> > >
> > >> > > These are minor improvements though, the addition of more metrics
> to
> > >> the
> > >> > > controller is already going to be very helpful.
> > >> > >
> > >> > > Thanks
> > >> > >
> > >> > > Tom Crayford
> > >> > > Heroku Kafka
> > >> > >
> > >> > > On Thu, Apr 27, 2017 at 3:10 PM, Ismael Juma <[email protected]>
> > >> wrote:
> > >> > >
> > >> > > > Hi all,
> > >> > > >
> > >> > > > We've posted "KIP-143: Controller Health Metrics" for
> discussion:
> > >> > > >
> > >> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> > > > 143%3A+Controller+Health+Metrics
> > >> > > >
> > >> > > > Please take a look. Your feedback is appreciated.
> > >> > > >
> > >> > > > Thanks,
> > >> > > > Ismael
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [DISCUSS] KIP-143: Controller Health Metrics

Reply via email to