Hey Stanislav, Sorry for the delay on this. In the meantime I realized that the dead fetchers won't be removed from the fetcher map, so it's very easy to figure out how many dead and alive there are. I can collect them on broker level which I think gives a good enough information if there is a problem with a given broker. You do raise a good point with your idea that it's helpful to know which fetcher is acting up, although I find that solution less feasible due to the number of metrics generated. For this I'd use a JMX method that'd return a list of problematic fetcher threads but I'm not sure if we have to extend the scope of this KIP that much.
Best, Viktor On Mon, Mar 4, 2019 at 7:22 PM Stanislav Kozlovski <stanis...@confluent.io> wrote: > Hey Viktor, > > > however displaying the thread count (be it alive or dead) would still add > extra information regarding the failure, that a thread died during cleanup. > > I agree, I think it's worth adding. > > > > Doing this on the replica fetchers though would be a bit harder as the > number of replica fetchers is the (brokers-to-fetch-from * > fetchers-per-broker) and we don't really maintain the capacity information > or any kind of cluster information and I'm not sure we should. > > Perhaps we could split the metric per broker that is being fetched from? > i.e each replica fetcher would have a `dead-fetcher-threads` metric that > has the broker-id it's fetching from as a tag? > It would solve an observability question which I think is very important - > are we replicating from this broker at all? > On the other hand, this could potentially produce a lot of metric data with > a big cluster, so that is definitely something to consider as well. > > All in all, I think this is a great KIP and very much needed in my opinion. > I can't wait to see this roll out > > Best, > Stanislav > > On Mon, Feb 25, 2019 at 10:29 AM Viktor Somogyi-Vass < > viktorsomo...@gmail.com> wrote: > > > Hi Stanislav, > > > > Thanks for the feedback and sharing that discussion thread. > > > > I read your KIP and the discussion on it too and it seems like that'd > cover > > the same motivation I had with the log-cleaner-thread-count metric. This > > supposed to tell the count of the alive threads which might differ from > the > > config (I could've used a better name :) ). Now I'm thinking that using > > uncleanable-bytes, uncleanable-partition-count together with > > time-since-last-run would mostly cover the motivation I have in this KIP, > > however displaying the thread count (be it alive or dead) would still add > > extra information regarding the failure, that a thread died during > cleanup. > > > > You had a very good idea about instead of the alive threads, display the > > dead ones! That way we wouldn't need log-cleaner-current-live-thread-rate > > just a "dead-log-cleaner-thread-count" as it it would make easy to > trigger > > warnings based on that (if it's even > 0 then we can say there's a > > potential problem). > > Doing this on the replica fetchers though would be a bit harder as the > > number of replica fetchers is the (brokers-to-fetch-from * > > fetchers-per-broker) and we don't really maintain the capacity > information > > or any kind of cluster information and I'm not sure we should. It would > add > > too much responsibility to the class and wouldn't be a rock-solid > solution > > but I guess I have to look into it more. > > > > I don't think that restarting the cleaner threads would be helpful as the > > problems I've seen mostly are non-recoverable and requires manual user > > intervention and partly I agree what Colin said on the KIP-346 discussion > > thread about the problems experienced with HDFS. > > > > Best, > > Viktor > > > > > > On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski < > > stanis...@confluent.io> > > wrote: > > > > > Hey Viktor, > > > > > > First off, thanks for the KIP! I think that it is almost always a good > > idea > > > to have more metrics. Observability never hurts. > > > > > > In regards to the LogCleaner: > > > * Do we need to know log-cleaner-thread-count? That should always be > > equal > > > to "log.cleaner.threads" if I'm not mistaken. > > > * log-cleaner-current-live-thread-rate - We already have the > > > "time-since-last-run-ms" metric which can let you know if something is > > > wrong with the log cleaning > > > As you said, we would like to have these two new metrics in order to > > > understand when a partial failure has happened - e.g only 1/3 log > cleaner > > > threads are alive. I'm wondering if it may make more sense to either: > > > a) restart the threads when they die > > > b) add a metric which shows the dead thread count. You should probably > > > always have a low-level alert in the case that any threads have died > > > > > > We had discussed a similar topic about thread revival and metrics in > > > KIP-346. Have you had a chance to look over that discussion? Here is > the > > > mailing discussion for that - > > > > > > > > > http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3ccanzzngyr_22go9swl67hedcm90xhvpyfzy_tezhz1mrizqk...@mail.gmail.com%3E > > > > > > Best, > > > Stanislav > > > > > > > > > > > > On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass < > > > viktorsomo...@gmail.com> wrote: > > > > > > > Hi All, > > > > > > > > I'd like to start a discussion about exposing count gauge metrics for > > the > > > > replica fetcher and log cleaner thread counts. It isn't a long KIP > and > > > the > > > > motivation is very simple: monitoring the thread counts in these > cases > > > > would help with the investigation of various issues and might help in > > > > preventing more serious issues when a broker is in a bad state. Such > a > > > > scenario that we seen with users is that their disk fills up as the > log > > > > cleaner died for some reason and couldn't recover (like log > > corruption). > > > In > > > > this case an early warning would help in the root cause analysis > > process > > > as > > > > well as enable detecting and resolving the problem early on. > > > > > > > > The KIP is here: > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics > > > > > > > > I'd be happy to receive any feedback on this. > > > > > > > > Regards, > > > > Viktor > > > > > > > > > > > > > -- > > > Best, > > > Stanislav > > > > > > > > -- > Best, > Stanislav >