Hey Folks,

Since there were no discussions on this in the past two weeks I'll create a
VOTE thread soon.

Thanks,
Viktor

On Thu, Mar 14, 2019 at 7:05 PM Viktor Somogyi-Vass <viktorsomo...@gmail.com>
wrote:

> Hey Stanislav,
>
> Sorry for the delay on this. In the meantime I realized that the dead
> fetchers won't be removed from the fetcher map, so it's very easy to figure
> out how many dead and alive there are. I can collect them on broker level
> which I think gives a good enough information if there is a problem with a
> given broker. You do raise a good point with your idea that it's helpful to
> know which fetcher is acting up, although I find that solution less
> feasible due to the number of metrics generated. For this I'd use a JMX
> method that'd return a list of problematic fetcher threads but I'm not sure
> if we have to extend the scope of this KIP that much.
>
> Best,
> Viktor
>
> On Mon, Mar 4, 2019 at 7:22 PM Stanislav Kozlovski <stanis...@confluent.io>
> wrote:
>
>> Hey Viktor,
>>
>> > however displaying the thread count (be it alive or dead) would still
>> add
>> extra information regarding the failure, that a thread died during
>> cleanup.
>>
>> I agree, I think it's worth adding.
>>
>>
>> > Doing this on the replica fetchers though would be a bit harder as the
>> number of replica fetchers is the (brokers-to-fetch-from *
>> fetchers-per-broker) and we don't really maintain the capacity information
>> or any kind of cluster information and I'm not sure we should.
>>
>> Perhaps we could split the metric per broker that is being fetched from?
>> i.e each replica fetcher would have a `dead-fetcher-threads` metric that
>> has the broker-id it's fetching from as a tag?
>> It would solve an observability question which I think is very important -
>> are we replicating from this broker at all?
>> On the other hand, this could potentially produce a lot of metric data
>> with
>> a big cluster, so that is definitely something to consider as well.
>>
>> All in all, I think this is a great KIP and very much needed in my
>> opinion.
>> I can't wait to see this roll out
>>
>> Best,
>> Stanislav
>>
>> On Mon, Feb 25, 2019 at 10:29 AM Viktor Somogyi-Vass <
>> viktorsomo...@gmail.com> wrote:
>>
>> > Hi Stanislav,
>> >
>> > Thanks for the feedback and sharing that discussion thread.
>> >
>> > I read your KIP and the discussion on it too and it seems like that'd
>> cover
>> > the same motivation I had with the log-cleaner-thread-count metric. This
>> > supposed to tell the count of the alive threads which might differ from
>> the
>> > config (I could've used a better name :) ). Now I'm thinking that using
>> > uncleanable-bytes, uncleanable-partition-count together with
>> > time-since-last-run would mostly cover the motivation I have in this
>> KIP,
>> > however displaying the thread count (be it alive or dead) would still
>> add
>> > extra information regarding the failure, that a thread died during
>> cleanup.
>> >
>> > You had a very good idea about instead of the alive threads, display the
>> > dead ones! That way we wouldn't need
>> log-cleaner-current-live-thread-rate
>> > just a "dead-log-cleaner-thread-count" as it it would make easy to
>> trigger
>> > warnings based on that (if it's even > 0 then we can say there's a
>> > potential problem).
>> > Doing this on the replica fetchers though would be a bit harder as the
>> > number of replica fetchers is the (brokers-to-fetch-from *
>> > fetchers-per-broker) and we don't really maintain the capacity
>> information
>> > or any kind of cluster information and I'm not sure we should. It would
>> add
>> > too much responsibility to the class and wouldn't be a rock-solid
>> solution
>> > but I guess I have to look into it more.
>> >
>> > I don't think that restarting the cleaner threads would be helpful as
>> the
>> > problems I've seen mostly are non-recoverable and requires manual user
>> > intervention and partly I agree what Colin said on the KIP-346
>> discussion
>> > thread about the problems experienced with HDFS.
>> >
>> > Best,
>> > Viktor
>> >
>> >
>> > On Fri, Feb 22, 2019 at 5:03 PM Stanislav Kozlovski <
>> > stanis...@confluent.io>
>> > wrote:
>> >
>> > > Hey Viktor,
>> > >
>> > > First off, thanks for the KIP! I think that it is almost always a good
>> > idea
>> > > to have more metrics. Observability never hurts.
>> > >
>> > > In regards to the LogCleaner:
>> > > * Do we need to know log-cleaner-thread-count? That should always be
>> > equal
>> > > to "log.cleaner.threads" if I'm not mistaken.
>> > > * log-cleaner-current-live-thread-rate -  We already have the
>> > > "time-since-last-run-ms" metric which can let you know if something is
>> > > wrong with the log cleaning
>> > > As you said, we would like to have these two new metrics in order to
>> > > understand when a partial failure has happened - e.g only 1/3 log
>> cleaner
>> > > threads are alive. I'm wondering if it may make more sense to either:
>> > > a) restart the threads when they die
>> > > b) add a metric which shows the dead thread count. You should probably
>> > > always have a low-level alert in the case that any threads have died
>> > >
>> > > We had discussed a similar topic about thread revival and metrics in
>> > > KIP-346. Have you had a chance to look over that discussion? Here is
>> the
>> > > mailing discussion for that -
>> > >
>> > >
>> >
>> http://mail-archives.apache.org/mod_mbox/kafka-dev/201807.mbox/%3ccanzzngyr_22go9swl67hedcm90xhvpyfzy_tezhz1mrizqk...@mail.gmail.com%3E
>> > >
>> > > Best,
>> > > Stanislav
>> > >
>> > >
>> > >
>> > > On Fri, Feb 22, 2019 at 11:18 AM Viktor Somogyi-Vass <
>> > > viktorsomo...@gmail.com> wrote:
>> > >
>> > > > Hi All,
>> > > >
>> > > > I'd like to start a discussion about exposing count gauge metrics
>> for
>> > the
>> > > > replica fetcher and log cleaner thread counts. It isn't a long KIP
>> and
>> > > the
>> > > > motivation is very simple: monitoring the thread counts in these
>> cases
>> > > > would help with the investigation of various issues and might help
>> in
>> > > > preventing more serious issues when a broker is in a bad state.
>> Such a
>> > > > scenario that we seen with users is that their disk fills up as the
>> log
>> > > > cleaner died for some reason and couldn't recover (like log
>> > corruption).
>> > > In
>> > > > this case an early warning would help in the root cause analysis
>> > process
>> > > as
>> > > > well as enable detecting and resolving the problem early on.
>> > > >
>> > > > The KIP is here:
>> > > >
>> > > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-434%3A+Add+Replica+Fetcher+and+Log+Cleaner+Count+Metrics
>> > > >
>> > > > I'd be happy to receive any feedback on this.
>> > > >
>> > > > Regards,
>> > > > Viktor
>> > > >
>> > >
>> > >
>> > > --
>> > > Best,
>> > > Stanislav
>> > >
>> >
>>
>>
>> --
>> Best,
>> Stanislav
>>
>

Reply via email to