Denis, Yes, but can we look at proposed API before we dig into implementation?
On Tue, Nov 28, 2017 at 9:43 PM, Denis Magda <dma...@apache.org> wrote: > I think the failure processing policy should be configured via > IgniteConfiguration in a way similar to the segmentation policies. > > — > Denis > > > On Nov 27, 2017, at 11:28 PM, Vladimir Ozerov <voze...@gridgain.com> > wrote: > > > > Dmitry, > > > > How these policies will be configured? Do you have any API in mind? > > > > On Thu, Nov 23, 2017 at 6:26 PM, Denis Magda <dma...@apache.org> wrote: > > > >> No objections here. Additional policies like EXEC might be added later > >> depending on user needs. > >> > >> — > >> Denis > >> > >>> On Nov 23, 2017, at 2:26 AM, Дмитрий Сорокин < > sbt.sorokin....@gmail.com> > >> wrote: > >>> > >>> Denis, > >>> I propose start with first three policies (it's already implemented, > just > >>> await some code combing, commit & review). > >>> About of fourth policy (EXEC) I think that it's rather additional > >> property > >>> (some script path) than policy. > >>> > >>> 2017-11-23 0:43 GMT+03:00 Denis Magda <dma...@apache.org>: > >>> > >>>> Just provide FailureProcessingPolicy with possible reactions: > >>>> - NOOP - exceptions will be reported, metrics will be triggered but an > >>>> affected Ignite process won’t be touched. > >>>> - HAULT (or STOP or KILL) - all the actions of the of NOOP + Ignite > >>>> process termination. > >>>> - RESTART - NOOP actions + process restart. > >>>> - EXEC - execute a custom script provided by the user. > >>>> > >>>> If needed the policy can be set per know failure such is OOM, > >> Persistence > >>>> errors so that the user can act accordingly basing on a context. > >>>> > >>>> — > >>>> Denis > >>>> > >>>>> On Nov 21, 2017, at 11:43 PM, Vladimir Ozerov <voze...@gridgain.com> > >>>> wrote: > >>>>> > >>>>> In the first iteration I would focus only on reporting facilities, to > >> let > >>>>> administrator spot dangerous situation. And in the second phase, when > >> all > >>>>> reporting and metrics are ready, we can think on some automatic > >> actions. > >>>>> > >>>>> On Wed, Nov 22, 2017 at 10:39 AM, Mikhail Cherkasov < > >>>> mcherka...@gridgain.com > >>>>>> wrote: > >>>>> > >>>>>> Hi Anton, > >>>>>> > >>>>>> I don't think that we should shutdown node in case of > >>>> IgniteOOMException, > >>>>>> if one node has no space, then other probably don't have it too, so > >> re > >>>>>> -balancing will cause IgniteOOM on all other nodes and will kill the > >>>> whole > >>>>>> cluster. I think for some configurations cluster should survive and > >>>> allow > >>>>>> to user clean cache or/and add more nodes. > >>>>>> > >>>>>> Thanks, > >>>>>> Mikhail. > >>>>>> > >>>>>> 20 нояб. 2017 г. 6:53 ПП пользователь "Anton Vinogradov" < > >>>>>> avinogra...@gridgain.com> написал: > >>>>>> > >>>>>>> Igniters, > >>>>>>> > >>>>>>> Internal problems may and, unfortunately, cause unexpected cluster > >>>>>>> behavior. > >>>>>>> We should determine behavior in case any of internal problem > >> happened. > >>>>>>> > >>>>>>> Well known internal problems can be split to: > >>>>>>> 1) OOM or any other reason cause node crash > >>>>>>> > >>>>>>> 2) Situations required graceful node shutdown with custom > >> notification > >>>>>>> - IgniteOutOfMemoryException > >>>>>>> - Persistence errors > >>>>>>> - ExchangeWorker exits with error > >>>>>>> > >>>>>>> 3) Prefomance issues should be covered by metrics > >>>>>>> - GC STW duration > >>>>>>> - Timed out tasks and jobs > >>>>>>> - TX deadlock > >>>>>>> - Hanged Tx (waits for some service) > >>>>>>> - Java Deadlocks > >>>>>>> > >>>>>>> I created special issue [1] to make sure all these metrics will be > >>>>>>> presented at WebConsole or VisorConsole (what's preferred?) > >>>>>>> > >>>>>>> 4) Situations required external monitoring implementation > >>>>>>> - GC STW duration exceed maximum possible length (node should be > >>>> stopped > >>>>>>> before STW finished) > >>>>>>> > >>>>>>> All this problems were reported by different persons different time > >>>> ago, > >>>>>>> So, we should reanalyze each of them and, possible, find better > ways > >> to > >>>>>>> solve them than it described at issues. > >>>>>>> > >>>>>>> P.s. IEP-7 [2] already contains 9 issues, feel free to mention > >>>> something > >>>>>>> else :) > >>>>>>> > >>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-6961 > >>>>>>> [2] > >>>>>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP- > >>>>>>> 7%3A+Ignite+internal+problems+detection > >>>>>>> > >>>>>> > >>>> > >>>> > >> > >> > >