Hi Sergey, This is probably the most important IEP we have. I am assuming that after this gets fixed, Ignite cluster will never come to a freezing state.
I propose to name the enum *PmeStopPolicy*. Here are my suggestions: - NONE - will result in logging the state - STOP_PRESERVE_PARTITIONS - nodes will be stopped, as long as every partition has at least one copy in the cluster - STOP_ALL - all frozen nodes will be stopped, if partitions are lost, cluster will enter read-only state and will not serve data for the lost partitions. I also have some questions: - Does this policy apply only to the server nodes, or to client nodes as well? - Can the nodes be automatically restarted? D. On Thu, Jun 21, 2018 at 5:14 AM, Sergey Chugunov <sergey.chugu...@gmail.com> wrote: > Igniters, > > I've created new IEP [1] to address important case when Partition Map > Exchange process (for more info on it refer to [2]) hangs for some reason. > > If this happens user now has to manually identify nodes causing PME to hang > and do necessary actions (usually it is enough to stop hanging nodes to > unblock PME). > > Identification and stopping of nodes blocking PME can be done automatically > by coordinator node, three scenarios are already described in corresponding > tickets on IEP page. > But when stopping nodes we should remember about chance of loosing > partitions: if nodes identified to be blocking PME hold all copies of a > partition, partition will be lost if coordinator decides to stop all nodes > unconditionally. > > To give user a choice I propose to add to configuration new policy: > PMEHangResolvePolicy with three options: > > - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear > message with information about hanging nodes and suggestions of how to > fix > the situation; > - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only > after it checks affinity distribution and makes sure no partitions will > be > lost; > - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes > unconditionally not making any checks against affinity distribution, so > partition loss may happen. > > > What does community think of proposed change? Are there any additional > cases not covered by tickets or comments about new policy? > > Thanks, > Sergey. > > [1] > https://cwiki.apache.org/confluence/display/IGNITE/IEP- > 25%3A+Partition+Map+Exchange+hangs+resolving > > [2] > https://cwiki.apache.org/confluence/display/IGNITE/% > 28Partition+Map%29+Exchange+-+under+the+hood >