HI Dmitry, I do not think it is good idea to mix failures of different threads into a single event type. - Practice shows that the most common source of problem is exchange thread - if disco worker has died, not will be excluded from topology safely - "nio-acceptor" can be spawn from multiple places where GridNioServer is started, not all of them are ciritical - "grid-nio-worker-tcp-comm" is internal thread which doesn't do any complex processing, so risk of it's crash is minimal
We could track most of them, but death of different threads may result in different actions from user side. So I propose to start with exchange thread only for now. Another important point, is that FailureProcessingPolicy should get enough information on what happened in order to decide how to react. E.g., as I explained earlier, IgniteOutOfMemoryException *is not critical error*. Nasty, but not deadly. And node should not be stopped blindly in response to this event. Vladimir. On Fri, Dec 1, 2017 at 3:50 AM, Denis Magda <dma...@apache.org> wrote: > Hi Dmitriy, > > I’m totally for the FailureProcessingPolicy addition to > IgniteConfiguration. > > Apart of this, may I ask you to create corresponding documentation tickets > for 2.4 release and “documentation” component? Only for the improvements > that are getting into the next release. Basically you can aggregate them if > it helps. Feel free to assign the ticket on me right away. > > — > Denis > > > On Nov 30, 2017, at 10:31 AM, Дмитрий Сорокин <sbt.sorokin....@gmail.com> > wrote: > > > > Hi, Igniters! > > > > We have a set of internal problems, which required graceful node > shutdown, > > or other reaction configured (See discussion thread > > http://apache-ignite-developers.2346864.n4.nabble. > com/Ignite-Enhancement-Proposal-7-Internal-problems-detection-td24460.html > > ): > > - IgniteOutOfMemoryException - > > https://issues.apache.org/jira/browse/IGNITE-6892 > > - Persistence errors - https://issues.apache.org/jira/browse/IGNITE-6891 > > - ExchangeWorker exits with error - > > https://issues.apache.org/jira/browse/IGNITE-6890 > > > > First, I propose reconsider 3rd problem as "System worker exit while node > > still running (node stopping process has not been started)", because we > > have at least 5 worker classes, which running is critical for node > working. > > > > These workers are: > > - partition-exchanger (ExchangeWorker) > > - disco-event-worker > > - nio-acceptor > > - grid-nio-worker-tcp-comm-* > > - grid-timeout-worker > > > > Second, I propose to use FailureProcessingPolicy (already implemented in > > scope of task IGNITE-6890) for reaction definition on 1st and 2nd > detected > > problems too. This policy can be configured similar to SegmentationPolicy > > in IgniteConfiguration. > > > > Opinions? > >