Hi, I agree with Yakov that we can provide some option that manage worker liveness checker behavior in case of observing that some worker is blocked too long. At least it will some workaround for cases when node fails is too annoying.
Backups count threshold sounds good but I don't understand how it will help in case of cluster hanging. The simplest solution here is alert in cases of blocking of some critical worker (we can improve WorkersRegistry for this purpose and expose list of blocked workers) and optionally call system configured failure processor. BTW, failure processor can be extended in order to perform any checks (e.g. backup count) and decide whether it should stop node or not. On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov <stku...@gmail.com> wrote: > > David, Yakov, I understand your fears. But liveness checks deal with > _critical_ conditions, i.e. when such a condition is met we conclude the > node as totally broken, and there is no sense to keep it alive regardless > the data it contains. If we want to give it a chance, then the condition > (long fsync etc.) should not considered as critical at all. > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov <yzhda...@apache.org>: > > > Agree with David. We need to have an opporunity set backups count threshold > > (at runtime also!) that will not allow any automatic stop if there will be > > a data loss. Andrey, what do you think? > > > > --Yakov > > > > > -- > Best regards, > Andrey Kuznetsov.