Hey Viktor,

> however displaying the thread count (be it alive or dead) would still add
extra information regarding the failure, that a thread died during cleanup.

I agree, I think it's worth adding.


> Doing this on the replica fetchers though would be a bit harder as the
number of replica fetchers is the (brokers-to-fetch-from *
fetchers-per-broker) and we don't really maintain the capacity information
or any kind of cluster information and I'm not sure we should.

Perhaps we could split the metric per broker that is being fetched from?
i.e each replica fetcher would have a `dead-fetcher-threads` metric that
has the broker-id it's fetching from as a tag?
It would solve an observability question which I think is very important -
are we replicating from this broker at all?
On the other hand, this could potentially produce a lot of metric data with
a big cluster, so that is definitely something to consider as well.

All in all, I think this is a great KIP and very much needed in my opinion.
I can't wait to see this roll out

Best,
Stanislav

On Mon, Feb 25, 2019 at 10:29 AM Viktor Somogyi-Vass <
viktorsomo...@gmail.com> wrote:

> Hi Stanislav,
>
> Thanks for the feedback and sharing that discussion thread.
>
> I read your KIP and the discussion on it too and it seems like that'd cover
> the same motivation I had with the log-cleaner-thread-count metric. This
> supposed to tell the count of the alive threads which might differ from the
> config (I could've used a better name :) ). Now I'm thinking that using
> uncleanable-bytes, uncleanable-partition-count together with
> time-since-last-run would mostly cover the motivation I have in this KIP,
> however displaying the thread count (be it alive or dead) would still add
> extra information regarding the failure, that a thread died during cleanup.
>
> You had a very good idea about instead of the alive threads, display the
> dead ones! That way we wouldn't need log-cleaner-current-live-thread-rate
> just a "dead-log-cleaner-thread-count" as it it would make easy to trigger
> warnings based on that (if it's even > 0 then we can say there's a
> potential problem).
> Doing this on the replica fetchers though would be a bit harder as the
> number of replica fetchers is the (brokers-to-fetch-from *
> fetchers-per-broker) and we don't really maintain the capacity information
> or any kind of cluster information and I'm not sure we should. It would add
> too much responsibility to the class and wouldn't be a rock-solid solution
> but I guess I have to look into it more.
>
> I don't think that restarting the cleaner threads would be helpful as the
> problems I've seen mostly are non-recoverable and requires manual user
> intervention and partly I agree what Colin said on the KIP-346 discussion
> thread about the problems experienced with HDFS.
>
> Best,
> Viktor
>
>
> On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski <
> stanis...@confluent.io>
> wrote:
>
> > Hey Viktor,
> >
> > First off, thanks for the KIP! I think that it is almost always a good
> idea
> > to have more metrics. Observability never hurts.
> >
> > In regards to the LogCleaner:
> > * Do we need to know log-cleaner-thread-count? That should always be
> equal
> > to "log.cleaner.threads" if I'm not mistaken.
> > * log-cleaner-current-live-thread-rate -  We already have the
> > "time-since-last-run-ms" metric which can let you know if something is
> > wrong with the log cleaning
> > As you said, we would like to have these two new metrics in order to
> > understand when a partial failure has happened - e.g only 1/3 log cleaner
> > threads are alive. I'm wondering if it may make more sense to either:
> > a) restart the threads when they die
> > b) add a metric which shows the dead thread count. You should probably
> > always have a low-level alert in the case that any threads have died
> >
> > We had discussed a similar topic about thread revival and metrics in
> > KIP-346. Have you had a chance to look over that discussion? Here is the
> > mailing discussion for that -
> >
> >
> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3ccanzzngyr_22go9swl67hedcm90xhvpyfzy_tezhz1mrizqk...@mail.gmail.com%3E
> >
> > Best,
> > Stanislav
> >
> >
> >
> > On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
> > viktorsomo...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I'd like to start a discussion about exposing count gauge metrics for
> the
> > > replica fetcher and log cleaner thread counts. It isn't a long KIP and
> > the
> > > motivation is very simple: monitoring the thread counts in these cases
> > > would help with the investigation of various issues and might help in
> > > preventing more serious issues when a broker is in a bad state. Such a
> > > scenario that we seen with users is that their disk fills up as the log
> > > cleaner died for some reason and couldn't recover (like log
> corruption).
> > In
> > > this case an early warning would help in the root cause analysis
> process
> > as
> > > well as enable detecting and resolving the problem early on.
> > >
> > > The KIP is here:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
> > >
> > > I'd be happy to receive any feedback on this.
> > >
> > > Regards,
> > > Viktor
> > >
> >
> >
> > --
> > Best,
> > Stanislav
> >
>


-- 
Best,
Stanislav

Reply via email to