When I've done this before,I've needed to find the oldest  thread, and kill
the node running that.   From a language standpoint, Maxim's "without
progress" better than "heartbeat".   For example, what I'm most interested
in on a distributed system is which thread started the work it has not
completed the earliest, and when did that thread last make forward
process.     You don't want to kill a node because a thread is waiting on a
lock held by a thread that went off-node and has not gotten a response.
If you don't understand the dependency relationships, you will make
incorrect recovery decisions.

On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov <maxmu...@gmail.com> wrote:

> I think we should find exact answers to these questions:
>  1. What `critical` issue exactly is?
>  2. How can we find critical issues?
>  3. How can we handle critical issues?
>
> First,
>  - Ignore uninterruptable actions (e.g. worker\service shutdown)
>  - Long I/O operations (should be a configurable timeout for each type of
> usage)
>  - Infinite loops
>  - Stalled\deadlocked threads (and\or too many parked threads, exclude I/O)
>
> Second,
>  - The working queue is without progress (e.g. disco, exchange queues)
>  - Work hasn't been completed since the last heartbeat (checking
> milestones)
>  - Too many system resources used by a thread for the long period of time
> (allocated memory, CPU)
>  - Timing fields associated with each thread status exceeded a maximum time
> limit.
>
> Third (not too many options here),
>  - `log everything` should be the default behaviour in all these cases,
> since it may be difficult to find the cause after the restart.
>  - Wait some interval of time and kill the hanging node (cluster should be
> configured stable enough)
>
> Questions,
>  - Not sure, but can workers miss their heartbeat deadlines if CPU loads up
> to 80%-90%? Bursts of momentary overloads can be
>     expected behaviour as a normal part of system operations.
>  - Why do we decide that critical thread should monitor each other? For
> instance, if all the tasks were blocked and unable to run,
>     node reset would never occur. As for me, a better solution is to use a
> separate monitor thread or pool (maybe both with software
>     and hardware checks) that not only checks heartbeats but monitors the
> other system as well.
>
> On Mon, 10 Sep 2018 at 00:07 David Harvey <syssoft...@gmail.com> wrote:
>
> > It would be safer to restart the entire cluster than to remove the last
> > node for a cache that should be redundant.
> >
> > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura <ag...@apache.org> wrote:
> >
> > > Hi,
> > >
> > > I agree with Yakov that we can provide some option that manage worker
> > > liveness checker behavior in case of observing that some worker is
> > > blocked too long.
> > > At least it will  some workaround for cases when node fails is too
> > > annoying.
> > >
> > > Backups count threshold sounds good but I don't understand how it will
> > > help in case of cluster hanging.
> > >
> > > The simplest solution here is alert in cases of blocking of some
> > > critical worker (we can improve WorkersRegistry for this purpose and
> > > expose list of blocked workers) and optionally call system configured
> > > failure processor. BTW, failure processor can be extended in order to
> > > perform any checks (e.g. backup count) and decide whether it should
> > > stop node or not.
> > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <stku...@gmail.com>
> > wrote:
> > > >
> > > > David, Yakov, I understand your fears. But liveness checks deal with
> > > > _critical_ conditions, i.e. when such a condition is met we conclude
> > the
> > > > node as totally broken, and there is no sense to keep it alive
> > regardless
> > > > the data it contains. If we want to give it a chance, then the
> > condition
> > > > (long fsync etc.) should not considered as critical at all.
> > > >
> > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <yzhda...@apache.org>:
> > > >
> > > > > Agree with David. We need to have an opporunity set backups count
> > > threshold
> > > > > (at runtime also!) that will not allow any automatic stop if there
> > > will be
> > > > > a data loss. Andrey, what do you think?
> > > > >
> > > > > --Yakov
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >   Andrey Kuznetsov.
> > >
> >
> --
> --
> Maxim Muzafarov
>

Reply via email to