Igniters, I've created new IEP [1] to address important case when Partition Map Exchange process (for more info on it refer to [2]) hangs for some reason.
If this happens user now has to manually identify nodes causing PME to hang and do necessary actions (usually it is enough to stop hanging nodes to unblock PME). Identification and stopping of nodes blocking PME can be done automatically by coordinator node, three scenarios are already described in corresponding tickets on IEP page. But when stopping nodes we should remember about chance of loosing partitions: if nodes identified to be blocking PME hold all copies of a partition, partition will be lost if coordinator decides to stop all nodes unconditionally. To give user a choice I propose to add to configuration new policy: PMEHangResolvePolicy with three options: - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear message with information about hanging nodes and suggestions of how to fix the situation; - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only after it checks affinity distribution and makes sure no partitions will be lost; - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes unconditionally not making any checks against affinity distribution, so partition loss may happen. What does community think of proposed change? Are there any additional cases not covered by tickets or comments about new policy? Thanks, Sergey. [1] https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving [2] https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exchange+-+under+the+hood