Re: Critical worker threads liveness checking drawbacks

Alexey Goncharuk Fri, 28 Sep 2018 01:41:57 -0700

Nikolay, I agree, a user should be able to disable both thread liveness
check and checkpoint read lock timeout check from config and a system
property.


пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <nizhi...@apache.org>:

> Hello, Igniters.
>
> I found that this feature can't be disabled from config.
> The only way to disable it is from JMX bean.
>
> I think it very dangerous: If we have some corner case or a bug in this
> Watch Dog it can make Ignite unusable.
> I propose to implement possibility to disable this feature both - from
> config and from JVM options.
>
> What do you think?
>
> В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > Maxim,
> >
> > Thanks for being attentive! It's definitely a typo. Could you please
> create
> > an issue?
> >
> > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <maxmu...@gmail.com>:
> >
> > > Folks,
> > >
> > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master
> branch)
> > > exchange future wrapped
> > > with double `blockingSectionEnd` method. Is it correct? I just want to
> > > understand this change and
> > > how should I use this in the future.
> > >
> > > Should I file a new issue to fix this? I think here
> `blockingSectionBegin`
> > > method should be used.
> > >
> > > -------------
> > > blockingSectionEnd();
> > >
> > > try {
> > >     resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > } finally {
> > >     blockingSectionEnd();
> > > }
> > >
> > >
> > > [1]
> > >
> > >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > >
> > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <daradu...@gmail.com>
> > > wrote:
> > >
> > > > Andrey Gura, thank you for the answer!
> > > >
> > > > I agree that wrapping of 'init' method reduces the profit of watchdog
> > > > service in case of PME worker, but in other cases, we should wrap all
> > > > possible long sections on GridDhtPartitionExchangeFuture. For example
> > > > 'onCacheChangeRequest' method or
> > > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > > significant time (reproducer attached).
> > > >
> > > > I only want to point out a possible issue which may allow to end-user
> > > > halt the Ignite cluster accidentally.
> > > >
> > > > I'm sure that PME experts know how to fix this issue properly.
> > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <ag...@apache.org>
> wrote:
> > > > >
> > > > > Vyacheslav,
> > > > >
> > > > > Exchange worker is strongly tied with
> > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker
> also
> > > > > shouldn't be blocked for long time but in reality it happens.It
> also
> > > > > means that your change doesn't make sense.
> > > > >
> > > > > What actually make sense it is identification of places which
> > > > > intentionally blocking. May be some places/actions should be
> braced by
> > > > > blocking guards.
> > > > >
> > > > > If you have failing tests please make sure that your
> failureHandler is
> > > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > > [CRITICAL_WORKER_BLOCKED].
> > > > >
> > > > >
> > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > >
> > > daradu...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > Hi Igniters!
> > > > > >
> > > > > > Thank you for this important improvement!
> > > > > >
> > > > > > I've looked through implementation and noticed that
> > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in
> blocked
> > > > > > section. This means it easy to halt the node in case of
> longrunning
> > > > > > actions during PME, for example when we create a cache with
> > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > >
> > > > > > I'm not sure that it is the right behavior.
> > > > > >
> > > > > > I filled the issue [1] and prepared the PR [2] with reproducer
> and
> > > >
> > > > possible fix.
> > > > > >
> > > > > > Andrey, could you please look at and confirm that it makes sense?
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> stku...@gmail.com>
> > > >
> > > > wrote:
> > > > > > >
> > > > > > > Denis,
> > > > > > >
> > > > > > > I've created the ticket [1] with short description of the
> > > >
> > > > functionality.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > >
> > > > > > >
> > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <dma...@apache.org>:
> > > > > > >
> > > > > > > > Andrey K. and G.,
> > > > > > > >
> > > > > > > > Thanks, do we have a documentation ticket created? Prachi
> > >
> > > (copied)
> > > > can help
> > > > > > > > with the documentation.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Denis
> > > > > > > >
> > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> ag...@apache.org>
> > > >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Andrey,
> > > > > > > > >
> > > > > > > > > finally your change is merged to master branch.
> Congratulations
> > > >
> > > > and
> > > > > > > > > thank you very much! :)
> > > > > > > > >
> > > > > > > > > I think that the next step is feature that will allow
> signal
> > > >
> > > > about
> > > > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > > > >
> > > > > > > > > I hope you will continue development of this feature and
> > >
> > > provide
> > > > your
> > > > > > > > > vision in new JIRA issue.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > > >
> > > > stku...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > David, Maxim!
> > > > > > > > > >
> > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt
> all
> > > >
> > > > of them
> > > > > > > > > right
> > > > > > > > > > now: the scope is much broader than the scope of the
> change I
> > > > > > > >
> > > > > > > > implement.
> > > > > > > > > I
> > > > > > > > > > have had a talk to a group of Ignite commiters, and we
> agreed
> > > >
> > > > to
> > > > > > > > complete
> > > > > > > > > > the change as follows.
> > > > > > > > > > - Blocking instructions in system-critical which may
> > >
> > > resonably
> > > > last
> > > > > > > > long
> > > > > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > > > > - Failure handlers should have a setting to suppress some
> > > >
> > > > failures on
> > > > > > > > > > per-failure-type basis.
> > > > > > > > > > According to this I have updated the implementation: [1]
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > > > > >
> > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> > > >
> > > > syssoft...@gmail.com>:
> > > > > > > > > >
> > > > > > > > > > > When I've done this before,I've needed to find the
> oldest
> > > >
> > > > thread,
> > > > > > > > and
> > > > > > > > > kill
> > > > > > > > > > > the node running that.   From a language standpoint,
> > >
> > > Maxim's
> > > > "without
> > > > > > > > > > > progress" better than "heartbeat".   For example, what
> I'm
> > > >
> > > > most
> > > > > > > > > interested
> > > > > > > > > > > in on a distributed system is which thread started the
> work
> > > >
> > > > it has
> > > > > > > > not
> > > > > > > > > > > completed the earliest, and when did that thread last
> make
> > > >
> > > > forward
> > > > > > > > > > > process.     You don't want to kill a node because a
> thread
> > > >
> > > > is
> > > > > > > > waiting
> > > > > > > > > on a
> > > > > > > > > > > lock held by a thread that went off-node and has not
> > >
> > > gotten a
> > > > > > > > response.
> > > > > > > > > > > If you don't understand the dependency relationships,
> you
> > > >
> > > > will make
> > > > > > > > > > > incorrect recovery decisions.
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <
> > > >
> > > > maxmu...@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > I think we should find exact answers to these
> questions:
> > > > > > > > > > > >  1. What `critical` issue exactly is?
> > > > > > > > > > > >  2. How can we find critical issues?
> > > > > > > > > > > >  3. How can we handle critical issues?
> > > > > > > > > > > >
> > > > > > > > > > > > First,
> > > > > > > > > > > >  - Ignore uninterruptable actions (e.g.
> worker\service
> > > >
> > > > shutdown)
> > > > > > > > > > > >  - Long I/O operations (should be a configurable
> timeout
> > > >
> > > > for each
> > > > > > > > > type of
> > > > > > > > > > > > usage)
> > > > > > > > > > > >  - Infinite loops
> > > > > > > > > > > >  - Stalled\deadlocked threads (and\or too many parked
> > > >
> > > > threads,
> > > > > > > > > exclude
> > > > > > > > > > > I/O)
> > > > > > > > > > > >
> > > > > > > > > > > > Second,
> > > > > > > > > > > >  - The working queue is without progress (e.g. disco,
> > > >
> > > > exchange
> > > > > > > > > queues)
> > > > > > > > > > > >  - Work hasn't been completed since the last
> heartbeat
> > > >
> > > > (checking
> > > > > > > > > > > > milestones)
> > > > > > > > > > > >  - Too many system resources used by a thread for the
> > >
> > > long
> > > > period
> > > > > > > > of
> > > > > > > > > time
> > > > > > > > > > > > (allocated memory, CPU)
> > > > > > > > > > > >  - Timing fields associated with each thread status
> > > >
> > > > exceeded a
> > > > > > > > > maximum
> > > > > > > > > > > time
> > > > > > > > > > > > limit.
> > > > > > > > > > > >
> > > > > > > > > > > > Third (not too many options here),
> > > > > > > > > > > >  - `log everything` should be the default behaviour
> in
> > >
> > > all
> > > > these
> > > > > > > > > cases,
> > > > > > > > > > > > since it may be difficult to find the cause after the
> > > >
> > > > restart.
> > > > > > > > > > > >  - Wait some interval of time and kill the hanging
> node
> > > >
> > > > (cluster
> > > > > > > > > should
> > > > > > > > > > > be
> > > > > > > > > > > > configured stable enough)
> > > > > > > > > > > >
> > > > > > > > > > > > Questions,
> > > > > > > > > > > >  - Not sure, but can workers miss their heartbeat
> > > >
> > > > deadlines if CPU
> > > > > > > > > loads
> > > > > > > > > > > up
> > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > > > > > > >     expected behaviour as a normal part of system
> > > >
> > > > operations.
> > > > > > > > > > > >  - Why do we decide that critical thread should
> monitor
> > > >
> > > > each other?
> > > > > > > > > For
> > > > > > > > > > > > instance, if all the tasks were blocked and unable to
> > >
> > > run,
> > > > > > > > > > > >     node reset would never occur. As for me, a better
> > > >
> > > > solution is
> > > > > > > > to
> > > > > > > > > use
> > > > > > > > > > > a
> > > > > > > > > > > > separate monitor thread or pool (maybe both with
> software
> > > > > > > > > > > >     and hardware checks) that not only checks
> heartbeats
> > > >
> > > > but
> > > > > > > > > monitors the
> > > > > > > > > > > > other system as well.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey <
> > > >
> > > > syssoft...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > It would be safer to restart the entire cluster
> than to
> > > >
> > > > remove
> > > > > > > > the
> > > > > > > > > last
> > > > > > > > > > > > > node for a cache that should be redundant.
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <
> > > >
> > > > ag...@apache.org>
> > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I agree with Yakov that we can provide some
> option
> > > >
> > > > that manage
> > > > > > > > > worker
> > > > > > > > > > > > > > liveness checker behavior in case of observing
> that
> > > >
> > > > some worker
> > > > > > > > > is
> > > > > > > > > > > > > > blocked too long.
> > > > > > > > > > > > > > At least it will  some workaround for cases when
> node
> > > >
> > > > fails is
> > > > > > > > > too
> > > > > > > > > > > > > > annoying.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Backups count threshold sounds good but I don't
> > > >
> > > > understand how
> > > > > > > > it
> > > > > > > > > > > will
> > > > > > > > > > > > > > help in case of cluster hanging.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The simplest solution here is alert in cases of
> > > >
> > > > blocking of
> > > > > > > > some
> > > > > > > > > > > > > > critical worker (we can improve WorkersRegistry
> for
> > > >
> > > > this
> > > > > > > > purpose
> > > > > > > > > and
> > > > > > > > > > > > > > expose list of blocked workers) and optionally
> call
> > > >
> > > > system
> > > > > > > > > configured
> > > > > > > > > > > > > > failure processor. BTW, failure processor can be
> > > >
> > > > extended in
> > > > > > > > > order to
> > > > > > > > > > > > > > perform any checks (e.g. backup count) and decide
> > > >
> > > > whether it
> > > > > > > > > should
> > > > > > > > > > > > > > stop node or not.
> > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <
> > > > > > > > >
> > > > > > > > > stku...@gmail.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > David, Yakov, I understand your fears. But
> liveness
> > > >
> > > > checks
> > > > > > > > deal
> > > > > > > > > > > with
> > > > > > > > > > > > > > > _critical_ conditions, i.e. when such a
> condition
> > >
> > > is
> > > > met we
> > > > > > > > > > > conclude
> > > > > > > > > > > > > the
> > > > > > > > > > > > > > > node as totally broken, and there is no sense
> to
> > > >
> > > > keep it
> > > > > > > > alive
> > > > > > > > > > > > > regardless
> > > > > > > > > > > > > > > the data it contains. If we want to give it a
> > > >
> > > > chance, then
> > > > > > > > the
> > > > > > > > > > > > > condition
> > > > > > > > > > > > > > > (long fsync etc.) should not considered as
> critical
> > > >
> > > > at all.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <
> > > > > > > > >
> > > > > > > > > yzhda...@apache.org>:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Agree with David. We need to have an
> opporunity
> > > >
> > > > set backups
> > > > > > > > > count
> > > > > > > > > > > > > > threshold
> > > > > > > > > > > > > > > > (at runtime also!) that will not allow any
> > > >
> > > > automatic stop
> > > > > > > > if
> > > > > > > > > > > there
> > > > > > > > > > > > > > will be
> > > > > > > > > > > > > > > > a data loss. Andrey, what do you think?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --Yakov
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --
> > > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > >   Andrey Kuznetsov.
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > --
> > > > > > > > > > > > Maxim Muzafarov
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > >   Andrey Kuznetsov.
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Best regards,
> > > > > > >   Andrey Kuznetsov.
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best Regards, Vyacheslav D.
> > > >
> > > >
> > > >
> > > > --
> > > > Best Regards, Vyacheslav D.
> > > >
> > >
> > > --
> > > --
> > > Maxim Muzafarov
> > >
> >
> >
>

Re: Critical worker threads liveness checking drawbacks

Reply via email to