Hi,

I agree with Yakov that we can provide some option that manage worker
liveness checker behavior in case of observing that some worker is
blocked too long.
At least it will  some workaround for cases when node fails is too annoying.

Backups count threshold sounds good but I don't understand how it will
help in case of cluster hanging.

The simplest solution here is alert in cases of blocking of some
critical worker (we can improve WorkersRegistry for this purpose and
expose list of blocked workers) and optionally call system configured
failure processor. BTW, failure processor can be extended in order to
perform any checks (e.g. backup count) and decide whether it should
stop node or not.
On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <stku...@gmail.com> wrote:
>
> David, Yakov, I understand your fears. But liveness checks deal with
> _critical_ conditions, i.e. when such a condition is met we conclude the
> node as totally broken, and there is no sense to keep it alive regardless
> the data it contains. If we want to give it a chance, then the condition
> (long fsync etc.) should not considered as critical at all.
>
> сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <yzhda...@apache.org>:
>
> > Agree with David. We need to have an opporunity set backups count threshold
> > (at runtime also!) that will not allow any automatic stop if there will be
> > a data loss. Andrey, what do you think?
> >
> > --Yakov
> >
>
>
> --
> Best regards,
>   Andrey Kuznetsov.

Reply via email to