Dmitry,

Seems, we found that it's impossible to specify one action for all cases,
but it's a good idea to allow user to decide what to do.
We should make something like

interface IgniteFailureHandler {
   IgniteFailureAction onFailure(IgniteFailureCause cause);
}

public enum IgniteFailureAction {
    RESTART_JVM,
    STOP,
    NOOP;
}

and ability to set it to IgniteConfiguration.
Also, we should provide default implementation of IgniteFailureHandler which
should be enabled by default and can be replaced by user's code.

On Fri, Dec 1, 2017 at 4:27 PM, Vladimir Ozerov <voze...@gridgain.com>
wrote:

> HI Dmitry,
>
> I do not think it is good idea to mix failures of different threads into a
> single event type.
> - Practice shows that the most common source of problem is exchange thread
> - if disco worker has died, not will be excluded from topology safely
> - "nio-acceptor" can be spawn from multiple places where GridNioServer is
> started, not all of them are ciritical
> - "grid-nio-worker-tcp-comm" is internal thread which doesn't do any
> complex processing, so risk of it's crash is minimal
>
> We could track most of them, but death of different threads may result in
> different actions from user side. So I propose to start with exchange
> thread only for now.
>
> Another important point, is that FailureProcessingPolicy should get enough
> information on what happened in order to decide how to react. E.g., as I
> explained earlier, IgniteOutOfMemoryException *is not critical error*.
> Nasty, but not deadly. And node should not be stopped blindly in response
> to this event.
>
> Vladimir.
>
>
> On Fri, Dec 1, 2017 at 3:50 AM, Denis Magda <dma...@apache.org> wrote:
>
> > Hi Dmitriy,
> >
> > I’m totally for the FailureProcessingPolicy addition to
> > IgniteConfiguration.
> >
> > Apart of this, may I ask you to create corresponding documentation
> tickets
> > for 2.4 release and “documentation” component? Only for the improvements
> > that are getting into the next release. Basically you can aggregate them
> if
> > it helps. Feel free to assign the ticket on me right away.
> >
> > —
> > Denis
> >
> > > On Nov 30, 2017, at 10:31 AM, Дмитрий Сорокин <
> sbt.sorokin....@gmail.com>
> > wrote:
> > >
> > > Hi, Igniters!
> > >
> > > We have a set of internal problems, which required graceful node
> > shutdown,
> > > or other reaction configured (See discussion thread
> > > http://apache-ignite-developers.2346864.n4.nabble.
> > com/Ignite-Enhancement-Proposal-7-Internal-problems-
> detection-td24460.html
> > > ):
> > > - IgniteOutOfMemoryException -
> > > https://issues.apache.org/jira/browse/IGNITE-6892
> > > - Persistence errors - https://issues.apache.org/
> jira/browse/IGNITE-6891
> > > - ExchangeWorker exits with error -
> > > https://issues.apache.org/jira/browse/IGNITE-6890
> > >
> > > First, I propose reconsider 3rd problem as "System worker exit while
> node
> > > still running (node stopping process has not been started)", because we
> > > have at least 5 worker classes, which running is critical for node
> > working.
> > >
> > > These workers are:
> > > - partition-exchanger (ExchangeWorker)
> > > - disco-event-worker
> > > - nio-acceptor
> > > - grid-nio-worker-tcp-comm-*
> > > - grid-timeout-worker
> > >
> > > Second, I propose to use FailureProcessingPolicy (already implemented
> in
> > > scope of task IGNITE-6890) for reaction definition on 1st and 2nd
> > detected
> > > problems too. This policy can be configured similar to
> SegmentationPolicy
> > > in IgniteConfiguration.
> > >
> > > Opinions?
> >
> >
>

Reply via email to