Dmitriy, Answering to your questions:
1) policy applies only to servers as coordinator never waits for clients in PME protocol; 2) we cannot restart handing node automatically only stop it. Node restart should be responsibility of monitoring system or end user. -- Thanks, Sergey. On Thu, Jun 21, 2018 at 4:21 PM Dmitriy Setrakyan <dsetrak...@apache.org> wrote: > Hi Sergey, > > This is probably the most important IEP we have. I am assuming that after > this gets fixed, Ignite cluster will never come to a freezing state. > > I propose to name the enum *PmeStopPolicy*. Here are my suggestions: > > - NONE - will result in logging the state > - STOP_PRESERVE_PARTITIONS - nodes will be stopped, as long as every > partition has at least one copy in the cluster > - STOP_ALL - all frozen nodes will be stopped, if partitions are lost, > cluster will enter read-only state and will not serve data for the lost > partitions. > > I also have some questions: > > - Does this policy apply only to the server nodes, or to client nodes as > well? > - Can the nodes be automatically restarted? > > D. > > > On Thu, Jun 21, 2018 at 5:14 AM, Sergey Chugunov < > sergey.chugu...@gmail.com> > wrote: > > > Igniters, > > > > I've created new IEP [1] to address important case when Partition Map > > Exchange process (for more info on it refer to [2]) hangs for some > reason. > > > > If this happens user now has to manually identify nodes causing PME to > hang > > and do necessary actions (usually it is enough to stop hanging nodes to > > unblock PME). > > > > Identification and stopping of nodes blocking PME can be done > automatically > > by coordinator node, three scenarios are already described in > corresponding > > tickets on IEP page. > > But when stopping nodes we should remember about chance of loosing > > partitions: if nodes identified to be blocking PME hold all copies of a > > partition, partition will be lost if coordinator decides to stop all > nodes > > unconditionally. > > > > To give user a choice I propose to add to configuration new policy: > > PMEHangResolvePolicy with three options: > > > > - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear > > message with information about hanging nodes and suggestions of how to > > fix > > the situation; > > - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only > > after it checks affinity distribution and makes sure no partitions > will > > be > > lost; > > - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes > > unconditionally not making any checks against affinity distribution, > so > > partition loss may happen. > > > > > > What does community think of proposed change? Are there any additional > > cases not covered by tickets or comments about new policy? > > > > Thanks, > > Sergey. > > > > [1] > > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > > 25%3A+Partition+Map+Exchange+hangs+resolving > > > > [2] > > https://cwiki.apache.org/confluence/display/IGNITE/% > > 28Partition+Map%29+Exchange+-+under+the+hood > > >