Hey Folks, Since there were no discussions on this in the past two weeks I'll create a VOTE thread soon.
Thanks, Viktor On Thu, Mar 14, 2019 at 7:05 PM Viktor Somogyi-Vass <viktorsomo...@gmail.com> wrote: > Hey Stanislav, > > Sorry for the delay on this. In the meantime I realized that the dead > fetchers won't be removed from the fetcher map, so it's very easy to figure > out how many dead and alive there are. I can collect them on broker level > which I think gives a good enough information if there is a problem with a > given broker. You do raise a good point with your idea that it's helpful to > know which fetcher is acting up, although I find that solution less > feasible due to the number of metrics generated. For this I'd use a JMX > method that'd return a list of problematic fetcher threads but I'm not sure > if we have to extend the scope of this KIP that much. > > Best, > Viktor > > On Mon, Mar 4, 2019 at 7:22 PM Stanislav Kozlovski <stanis...@confluent.io> > wrote: > >> Hey Viktor, >> >> > however displaying the thread count (be it alive or dead) would still >> add >> extra information regarding the failure, that a thread died during >> cleanup. >> >> I agree, I think it's worth adding. >> >> >> > Doing this on the replica fetchers though would be a bit harder as the >> number of replica fetchers is the (brokers-to-fetch-from * >> fetchers-per-broker) and we don't really maintain the capacity information >> or any kind of cluster information and I'm not sure we should. >> >> Perhaps we could split the metric per broker that is being fetched from? >> i.e each replica fetcher would have a `dead-fetcher-threads` metric that >> has the broker-id it's fetching from as a tag? >> It would solve an observability question which I think is very important - >> are we replicating from this broker at all? >> On the other hand, this could potentially produce a lot of metric data >> with >> a big cluster, so that is definitely something to consider as well. >> >> All in all, I think this is a great KIP and very much needed in my >> opinion. >> I can't wait to see this roll out >> >> Best, >> Stanislav >> >> On Mon, Feb 25, 2019 at 10:29 AM Viktor Somogyi-Vass < >> viktorsomo...@gmail.com> wrote: >> >> > Hi Stanislav, >> > >> > Thanks for the feedback and sharing that discussion thread. >> > >> > I read your KIP and the discussion on it too and it seems like that'd >> cover >> > the same motivation I had with the log-cleaner-thread-count metric. This >> > supposed to tell the count of the alive threads which might differ from >> the >> > config (I could've used a better name :) ). Now I'm thinking that using >> > uncleanable-bytes, uncleanable-partition-count together with >> > time-since-last-run would mostly cover the motivation I have in this >> KIP, >> > however displaying the thread count (be it alive or dead) would still >> add >> > extra information regarding the failure, that a thread died during >> cleanup. >> > >> > You had a very good idea about instead of the alive threads, display the >> > dead ones! That way we wouldn't need >> log-cleaner-current-live-thread-rate >> > just a "dead-log-cleaner-thread-count" as it it would make easy to >> trigger >> > warnings based on that (if it's even > 0 then we can say there's a >> > potential problem). >> > Doing this on the replica fetchers though would be a bit harder as the >> > number of replica fetchers is the (brokers-to-fetch-from * >> > fetchers-per-broker) and we don't really maintain the capacity >> information >> > or any kind of cluster information and I'm not sure we should. It would >> add >> > too much responsibility to the class and wouldn't be a rock-solid >> solution >> > but I guess I have to look into it more. >> > >> > I don't think that restarting the cleaner threads would be helpful as >> the >> > problems I've seen mostly are non-recoverable and requires manual user >> > intervention and partly I agree what Colin said on the KIP-346 >> discussion >> > thread about the problems experienced with HDFS. >> > >> > Best, >> > Viktor >> > >> > >> > On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski < >> > stanis...@confluent.io> >> > wrote: >> > >> > > Hey Viktor, >> > > >> > > First off, thanks for the KIP! I think that it is almost always a good >> > idea >> > > to have more metrics. Observability never hurts. >> > > >> > > In regards to the LogCleaner: >> > > * Do we need to know log-cleaner-thread-count? That should always be >> > equal >> > > to "log.cleaner.threads" if I'm not mistaken. >> > > * log-cleaner-current-live-thread-rate - We already have the >> > > "time-since-last-run-ms" metric which can let you know if something is >> > > wrong with the log cleaning >> > > As you said, we would like to have these two new metrics in order to >> > > understand when a partial failure has happened - e.g only 1/3 log >> cleaner >> > > threads are alive. I'm wondering if it may make more sense to either: >> > > a) restart the threads when they die >> > > b) add a metric which shows the dead thread count. You should probably >> > > always have a low-level alert in the case that any threads have died >> > > >> > > We had discussed a similar topic about thread revival and metrics in >> > > KIP-346. Have you had a chance to look over that discussion? Here is >> the >> > > mailing discussion for that - >> > > >> > > >> > >> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3ccanzzngyr_22go9swl67hedcm90xhvpyfzy_tezhz1mrizqk...@mail.gmail.com%3E >> > > >> > > Best, >> > > Stanislav >> > > >> > > >> > > >> > > On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass < >> > > viktorsomo...@gmail.com> wrote: >> > > >> > > > Hi All, >> > > > >> > > > I'd like to start a discussion about exposing count gauge metrics >> for >> > the >> > > > replica fetcher and log cleaner thread counts. It isn't a long KIP >> and >> > > the >> > > > motivation is very simple: monitoring the thread counts in these >> cases >> > > > would help with the investigation of various issues and might help >> in >> > > > preventing more serious issues when a broker is in a bad state. >> Such a >> > > > scenario that we seen with users is that their disk fills up as the >> log >> > > > cleaner died for some reason and couldn't recover (like log >> > corruption). >> > > In >> > > > this case an early warning would help in the root cause analysis >> > process >> > > as >> > > > well as enable detecting and resolving the problem early on. >> > > > >> > > > The KIP is here: >> > > > >> > > > >> > > >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics >> > > > >> > > > I'd be happy to receive any feedback on this. >> > > > >> > > > Regards, >> > > > Viktor >> > > > >> > > >> > > >> > > -- >> > > Best, >> > > Stanislav >> > > >> > >> >> >> -- >> Best, >> Stanislav >> >