IEP-25: Partition Map Exchange hangs resolving

Sergey Chugunov Thu, 21 Jun 2018 05:15:31 -0700

Igniters,

I've created new IEP [1] to address important case when Partition Map
Exchange process (for more info on it refer to [2]) hangs for some reason.


If this happens user now has to manually identify nodes causing PME to hang
and do necessary actions (usually it is enough to stop hanging nodes to
unblock PME).

Identification and stopping of nodes blocking PME can be done automatically
by coordinator node, three scenarios are already described in corresponding
tickets on IEP page.
But when stopping nodes we should remember about chance of loosing
partitions: if nodes identified to be blocking PME hold all copies of a
partition, partition will be lost if coordinator decides to stop all nodes
unconditionally.

To give user a choice I propose to add to configuration new policy:
PMEHangResolvePolicy with three options:

   - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear
   message with information about hanging nodes and suggestions of how to fix
   the situation;
   - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only
   after it checks affinity distribution and makes sure no partitions will be
   lost;
   - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes
   unconditionally not making any checks against affinity distribution, so
   partition loss may happen.


What does community think of proposed change? Are there any additional
cases not covered by tickets or comments about new policy?

Thanks,
Sergey.

[1]
https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving

[2]
https://cwiki.apache.org/confluence/display/IGNITE/%28Partition+Map%29+Exchange+-+under+the+hood

IEP-25: Partition Map Exchange hangs resolving

Reply via email to