Re: [DISCUSS] KIP-434: Add Replica Fetcher and Log Cleaner Count Metrics

Viktor Somogyi-Vass Thu, 14 Mar 2019 11:05:26 -0700

Hey Stanislav,

Sorry for the delay on this. In the meantime I realized that the dead
fetchers won't be removed from the fetcher map, so it's very easy to figure
out how many dead and alive there are. I can collect them on broker level
which I think gives a good enough information if there is a problem with a
given broker. You do raise a good point with your idea that it's helpful to
know which fetcher is acting up, although I find that solution less
feasible due to the number of metrics generated. For this I'd use a JMX
method that'd return a list of problematic fetcher threads but I'm not sure
if we have to extend the scope of this KIP that much.


Best,
Viktor

On Mon, Mar 4, 2019 at 7:22 PM Stanislav Kozlovski <stanis...@confluent.io>
wrote:

> Hey Viktor,
>
> > however displaying the thread count (be it alive or dead) would still add
> extra information regarding the failure, that a thread died during cleanup.
>
> I agree, I think it's worth adding.
>
>
> > Doing this on the replica fetchers though would be a bit harder as the
> number of replica fetchers is the (brokers-to-fetch-from *
> fetchers-per-broker) and we don't really maintain the capacity information
> or any kind of cluster information and I'm not sure we should.
>
> Perhaps we could split the metric per broker that is being fetched from?
> i.e each replica fetcher would have a `dead-fetcher-threads` metric that
> has the broker-id it's fetching from as a tag?
> It would solve an observability question which I think is very important -
> are we replicating from this broker at all?
> On the other hand, this could potentially produce a lot of metric data with
> a big cluster, so that is definitely something to consider as well.
>
> All in all, I think this is a great KIP and very much needed in my opinion.
> I can't wait to see this roll out
>
> Best,
> Stanislav
>
> On Mon, Feb 25, 2019 at 10:29 AM Viktor Somogyi-Vass <
> viktorsomo...@gmail.com> wrote:
>
> > Hi Stanislav,
> >
> > Thanks for the feedback and sharing that discussion thread.
> >
> > I read your KIP and the discussion on it too and it seems like that'd
> cover
> > the same motivation I had with the log-cleaner-thread-count metric. This
> > supposed to tell the count of the alive threads which might differ from
> the
> > config (I could've used a better name :) ). Now I'm thinking that using
> > uncleanable-bytes, uncleanable-partition-count together with
> > time-since-last-run would mostly cover the motivation I have in this KIP,
> > however displaying the thread count (be it alive or dead) would still add
> > extra information regarding the failure, that a thread died during
> cleanup.
> >
> > You had a very good idea about instead of the alive threads, display the
> > dead ones! That way we wouldn't need log-cleaner-current-live-thread-rate
> > just a "dead-log-cleaner-thread-count" as it it would make easy to
> trigger
> > warnings based on that (if it's even > 0 then we can say there's a
> > potential problem).
> > Doing this on the replica fetchers though would be a bit harder as the
> > number of replica fetchers is the (brokers-to-fetch-from *
> > fetchers-per-broker) and we don't really maintain the capacity
> information
> > or any kind of cluster information and I'm not sure we should. It would
> add
> > too much responsibility to the class and wouldn't be a rock-solid
> solution
> > but I guess I have to look into it more.
> >
> > I don't think that restarting the cleaner threads would be helpful as the
> > problems I've seen mostly are non-recoverable and requires manual user
> > intervention and partly I agree what Colin said on the KIP-346 discussion
> > thread about the problems experienced with HDFS.
> >
> > Best,
> > Viktor
> >
> >
> > On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski <
> > stanis...@confluent.io>
> > wrote:
> >
> > > Hey Viktor,
> > >
> > > First off, thanks for the KIP! I think that it is almost always a good
> > idea
> > > to have more metrics. Observability never hurts.
> > >
> > > In regards to the LogCleaner:
> > > * Do we need to know log-cleaner-thread-count? That should always be
> > equal
> > > to "log.cleaner.threads" if I'm not mistaken.
> > > * log-cleaner-current-live-thread-rate -  We already have the
> > > "time-since-last-run-ms" metric which can let you know if something is
> > > wrong with the log cleaning
> > > As you said, we would like to have these two new metrics in order to
> > > understand when a partial failure has happened - e.g only 1/3 log
> cleaner
> > > threads are alive. I'm wondering if it may make more sense to either:
> > > a) restart the threads when they die
> > > b) add a metric which shows the dead thread count. You should probably
> > > always have a low-level alert in the case that any threads have died
> > >
> > > We had discussed a similar topic about thread revival and metrics in
> > > KIP-346. Have you had a chance to look over that discussion? Here is
> the
> > > mailing discussion for that -
> > >
> > >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3ccanzzngyr_22go9swl67hedcm90xhvpyfzy_tezhz1mrizqk...@mail.gmail.com%3E
> > >
> > > Best,
> > > Stanislav
> > >
> > >
> > >
> > > On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
> > > viktorsomo...@gmail.com> wrote:
> > >
> > > > Hi All,
> > > >
> > > > I'd like to start a discussion about exposing count gauge metrics for
> > the
> > > > replica fetcher and log cleaner thread counts. It isn't a long KIP
> and
> > > the
> > > > motivation is very simple: monitoring the thread counts in these
> cases
> > > > would help with the investigation of various issues and might help in
> > > > preventing more serious issues when a broker is in a bad state. Such
> a
> > > > scenario that we seen with users is that their disk fills up as the
> log
> > > > cleaner died for some reason and couldn't recover (like log
> > corruption).
> > > In
> > > > this case an early warning would help in the root cause analysis
> > process
> > > as
> > > > well as enable detecting and resolving the problem early on.
> > > >
> > > > The KIP is here:
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
> > > >
> > > > I'd be happy to receive any feedback on this.
> > > >
> > > > Regards,
> > > > Viktor
> > > >
> > >
> > >
> > > --
> > > Best,
> > > Stanislav
> > >
> >
>
>
> --
> Best,
> Stanislav
>

Re: [DISCUSS] KIP-434: Add Replica Fetcher and Log Cleaner Count Metrics

Reply via email to