Nikolay, I agree, a user should be able to disable both thread liveness check and checkpoint read lock timeout check from config and a system property.
пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <nizhi...@apache.org>: > Hello, Igniters. > > I found that this feature can't be disabled from config. > The only way to disable it is from JMX bean. > > I think it very dangerous: If we have some corner case or a bug in this > Watch Dog it can make Ignite unusable. > I propose to implement possibility to disable this feature both - from > config and from JVM options. > > What do you think? > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет: > > Maxim, > > > > Thanks for being attentive! It's definitely a typo. Could you please > create > > an issue? > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <maxmu...@gmail.com>: > > > > > Folks, > > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master > branch) > > > exchange future wrapped > > > with double `blockingSectionEnd` method. Is it correct? I just want to > > > understand this change and > > > how should I use this in the future. > > > > > > Should I file a new issue to fix this? I think here > `blockingSectionBegin` > > > method should be used. > > > > > > ------------- > > > blockingSectionEnd(); > > > > > > try { > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS); > > > } finally { > > > blockingSectionEnd(); > > > } > > > > > > > > > [1] > > > > > > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684 > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <daradu...@gmail.com> > > > wrote: > > > > > > > Andrey Gura, thank you for the answer! > > > > > > > > I agree that wrapping of 'init' method reduces the profit of watchdog > > > > service in case of PME worker, but in other cases, we should wrap all > > > > possible long sections on GridDhtPartitionExchangeFuture. For example > > > > 'onCacheChangeRequest' method or > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may take > > > > significant time (reproducer attached). > > > > > > > > I only want to point out a possible issue which may allow to end-user > > > > halt the Ignite cluster accidentally. > > > > > > > > I'm sure that PME experts know how to fix this issue properly. > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <ag...@apache.org> > wrote: > > > > > > > > > > Vyacheslav, > > > > > > > > > > Exchange worker is strongly tied with > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker > also > > > > > shouldn't be blocked for long time but in reality it happens.It > also > > > > > means that your change doesn't make sense. > > > > > > > > > > What actually make sense it is identification of places which > > > > > intentionally blocking. May be some places/actions should be > braced by > > > > > blocking guards. > > > > > > > > > > If you have failing tests please make sure that your > failureHandler is > > > > > NoOpFailureHandler or any other handler with ignoreFailureTypes = > > > > > [CRITICAL_WORKER_BLOCKED]. > > > > > > > > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur < > > > > > > daradu...@gmail.com> > > > > wrote: > > > > > > > > > > > > Hi Igniters! > > > > > > > > > > > > Thank you for this important improvement! > > > > > > > > > > > > I've looked through implementation and noticed that > > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in > blocked > > > > > > section. This means it easy to halt the node in case of > longrunning > > > > > > actions during PME, for example when we create a cache with > > > > > > StoreFactrory which connect to 3rd party DB. > > > > > > > > > > > > I'm not sure that it is the right behavior. > > > > > > > > > > > > I filled the issue [1] and prepared the PR [2] with reproducer > and > > > > > > > > possible fix. > > > > > > > > > > > > Andrey, could you please look at and confirm that it makes sense? > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710 > > > > > > [2] https://github.com/apache/ignite/pull/4845 > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov < > stku...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > Denis, > > > > > > > > > > > > > > I've created the ticket [1] with short description of the > > > > > > > > functionality. > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679 > > > > > > > > > > > > > > > > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda <dma...@apache.org>: > > > > > > > > > > > > > > > Andrey K. and G., > > > > > > > > > > > > > > > > Thanks, do we have a documentation ticket created? Prachi > > > > > > (copied) > > > > can help > > > > > > > > with the documentation. > > > > > > > > > > > > > > > > -- > > > > > > > > Denis > > > > > > > > > > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura < > ag...@apache.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Andrey, > > > > > > > > > > > > > > > > > > finally your change is merged to master branch. > Congratulations > > > > > > > > and > > > > > > > > > thank you very much! :) > > > > > > > > > > > > > > > > > > I think that the next step is feature that will allow > signal > > > > > > > > about > > > > > > > > > blocked threads to the monitoring tools via MXBean. > > > > > > > > > > > > > > > > > > I hope you will continue development of this feature and > > > > > > provide > > > > your > > > > > > > > > vision in new JIRA issue. > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov < > > > > > > > > stku...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > David, Maxim! > > > > > > > > > > > > > > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt > all > > > > > > > > of them > > > > > > > > > right > > > > > > > > > > now: the scope is much broader than the scope of the > change I > > > > > > > > > > > > > > > > implement. > > > > > > > > > I > > > > > > > > > > have had a talk to a group of Ignite commiters, and we > agreed > > > > > > > > to > > > > > > > > complete > > > > > > > > > > the change as follows. > > > > > > > > > > - Blocking instructions in system-critical which may > > > > > > resonably > > > > last > > > > > > > > long > > > > > > > > > > should be explicitly excluded from the monitoring. > > > > > > > > > > - Failure handlers should have a setting to suppress some > > > > > > > > failures on > > > > > > > > > > per-failure-type basis. > > > > > > > > > > According to this I have updated the implementation: [1] > > > > > > > > > > > > > > > > > > > > [1] https://github.com/apache/ignite/pull/4089 > > > > > > > > > > > > > > > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey < > > > > > > > > syssoft...@gmail.com>: > > > > > > > > > > > > > > > > > > > > > When I've done this before,I've needed to find the > oldest > > > > > > > > thread, > > > > > > > > and > > > > > > > > > kill > > > > > > > > > > > the node running that. From a language standpoint, > > > > > > Maxim's > > > > "without > > > > > > > > > > > progress" better than "heartbeat". For example, what > I'm > > > > > > > > most > > > > > > > > > interested > > > > > > > > > > > in on a distributed system is which thread started the > work > > > > > > > > it has > > > > > > > > not > > > > > > > > > > > completed the earliest, and when did that thread last > make > > > > > > > > forward > > > > > > > > > > > process. You don't want to kill a node because a > thread > > > > > > > > is > > > > > > > > waiting > > > > > > > > > on a > > > > > > > > > > > lock held by a thread that went off-node and has not > > > > > > gotten a > > > > > > > > response. > > > > > > > > > > > If you don't understand the dependency relationships, > you > > > > > > > > will make > > > > > > > > > > > incorrect recovery decisions. > > > > > > > > > > > > > > > > > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov < > > > > > > > > maxmu...@gmail.com> > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > I think we should find exact answers to these > questions: > > > > > > > > > > > > 1. What `critical` issue exactly is? > > > > > > > > > > > > 2. How can we find critical issues? > > > > > > > > > > > > 3. How can we handle critical issues? > > > > > > > > > > > > > > > > > > > > > > > > First, > > > > > > > > > > > > - Ignore uninterruptable actions (e.g. > worker\service > > > > > > > > shutdown) > > > > > > > > > > > > - Long I/O operations (should be a configurable > timeout > > > > > > > > for each > > > > > > > > > type of > > > > > > > > > > > > usage) > > > > > > > > > > > > - Infinite loops > > > > > > > > > > > > - Stalled\deadlocked threads (and\or too many parked > > > > > > > > threads, > > > > > > > > > exclude > > > > > > > > > > > I/O) > > > > > > > > > > > > > > > > > > > > > > > > Second, > > > > > > > > > > > > - The working queue is without progress (e.g. disco, > > > > > > > > exchange > > > > > > > > > queues) > > > > > > > > > > > > - Work hasn't been completed since the last > heartbeat > > > > > > > > (checking > > > > > > > > > > > > milestones) > > > > > > > > > > > > - Too many system resources used by a thread for the > > > > > > long > > > > period > > > > > > > > of > > > > > > > > > time > > > > > > > > > > > > (allocated memory, CPU) > > > > > > > > > > > > - Timing fields associated with each thread status > > > > > > > > exceeded a > > > > > > > > > maximum > > > > > > > > > > > time > > > > > > > > > > > > limit. > > > > > > > > > > > > > > > > > > > > > > > > Third (not too many options here), > > > > > > > > > > > > - `log everything` should be the default behaviour > in > > > > > > all > > > > these > > > > > > > > > cases, > > > > > > > > > > > > since it may be difficult to find the cause after the > > > > > > > > restart. > > > > > > > > > > > > - Wait some interval of time and kill the hanging > node > > > > > > > > (cluster > > > > > > > > > should > > > > > > > > > > > be > > > > > > > > > > > > configured stable enough) > > > > > > > > > > > > > > > > > > > > > > > > Questions, > > > > > > > > > > > > - Not sure, but can workers miss their heartbeat > > > > > > > > deadlines if CPU > > > > > > > > > loads > > > > > > > > > > > up > > > > > > > > > > > > to 80%-90%? Bursts of momentary overloads can be > > > > > > > > > > > > expected behaviour as a normal part of system > > > > > > > > operations. > > > > > > > > > > > > - Why do we decide that critical thread should > monitor > > > > > > > > each other? > > > > > > > > > For > > > > > > > > > > > > instance, if all the tasks were blocked and unable to > > > > > > run, > > > > > > > > > > > > node reset would never occur. As for me, a better > > > > > > > > solution is > > > > > > > > to > > > > > > > > > use > > > > > > > > > > > a > > > > > > > > > > > > separate monitor thread or pool (maybe both with > software > > > > > > > > > > > > and hardware checks) that not only checks > heartbeats > > > > > > > > but > > > > > > > > > monitors the > > > > > > > > > > > > other system as well. > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey < > > > > > > > > syssoft...@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > It would be safer to restart the entire cluster > than to > > > > > > > > remove > > > > > > > > the > > > > > > > > > last > > > > > > > > > > > > > node for a cache that should be redundant. > > > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura < > > > > > > > > ag...@apache.org> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree with Yakov that we can provide some > option > > > > > > > > that manage > > > > > > > > > worker > > > > > > > > > > > > > > liveness checker behavior in case of observing > that > > > > > > > > some worker > > > > > > > > > is > > > > > > > > > > > > > > blocked too long. > > > > > > > > > > > > > > At least it will some workaround for cases when > node > > > > > > > > fails is > > > > > > > > > too > > > > > > > > > > > > > > annoying. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Backups count threshold sounds good but I don't > > > > > > > > understand how > > > > > > > > it > > > > > > > > > > > will > > > > > > > > > > > > > > help in case of cluster hanging. > > > > > > > > > > > > > > > > > > > > > > > > > > > > The simplest solution here is alert in cases of > > > > > > > > blocking of > > > > > > > > some > > > > > > > > > > > > > > critical worker (we can improve WorkersRegistry > for > > > > > > > > this > > > > > > > > purpose > > > > > > > > > and > > > > > > > > > > > > > > expose list of blocked workers) and optionally > call > > > > > > > > system > > > > > > > > > configured > > > > > > > > > > > > > > failure processor. BTW, failure processor can be > > > > > > > > extended in > > > > > > > > > order to > > > > > > > > > > > > > > perform any checks (e.g. backup count) and decide > > > > > > > > whether it > > > > > > > > > should > > > > > > > > > > > > > > stop node or not. > > > > > > > > > > > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov < > > > > > > > > > > > > > > > > > > stku...@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > David, Yakov, I understand your fears. But > liveness > > > > > > > > checks > > > > > > > > deal > > > > > > > > > > > with > > > > > > > > > > > > > > > _critical_ conditions, i.e. when such a > condition > > > > > > is > > > > met we > > > > > > > > > > > conclude > > > > > > > > > > > > > the > > > > > > > > > > > > > > > node as totally broken, and there is no sense > to > > > > > > > > keep it > > > > > > > > alive > > > > > > > > > > > > > regardless > > > > > > > > > > > > > > > the data it contains. If we want to give it a > > > > > > > > chance, then > > > > > > > > the > > > > > > > > > > > > > condition > > > > > > > > > > > > > > > (long fsync etc.) should not considered as > critical > > > > > > > > at all. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov < > > > > > > > > > > > > > > > > > > yzhda...@apache.org>: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Agree with David. We need to have an > opporunity > > > > > > > > set backups > > > > > > > > > count > > > > > > > > > > > > > > threshold > > > > > > > > > > > > > > > > (at runtime also!) that will not allow any > > > > > > > > automatic stop > > > > > > > > if > > > > > > > > > > > there > > > > > > > > > > > > > > will be > > > > > > > > > > > > > > > > a data loss. Andrey, what do you think? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --Yakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > > > > Andrey Kuznetsov. > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > -- > > > > > > > > > > > > Maxim Muzafarov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best regards, > > > > > > > > > > Andrey Kuznetsov. > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best regards, > > > > > > > Andrey Kuznetsov. > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best Regards, Vyacheslav D. > > > > > > > > > > > > > > > > -- > > > > Best Regards, Vyacheslav D. > > > > > > > > > > -- > > > -- > > > Maxim Muzafarov > > > > > > > >