Hi Sergey,

This is probably the most important IEP we have. I am assuming that after
this gets fixed, Ignite cluster will never come to a freezing state.

I propose to name the enum *PmeStopPolicy*. Here are my suggestions:

   - NONE - will result in logging the state
   - STOP_PRESERVE_PARTITIONS - nodes will be stopped, as long as every
   partition has at least one copy in the cluster
   - STOP_ALL - all frozen nodes will be stopped, if partitions are lost,
   cluster will enter read-only state and will not serve data for the lost
   partitions.

I also have some questions:

- Does this policy apply only to the server nodes, or to client nodes as
well?
- Can the nodes be automatically restarted?

D.


On Thu, Jun 21, 2018 at 5:14 AM, Sergey Chugunov <sergey.chugu...@gmail.com>
wrote:

> Igniters,
>
> I've created new IEP [1] to address important case when Partition Map
> Exchange process (for more info on it refer to [2]) hangs for some reason.
>
> If this happens user now has to manually identify nodes causing PME to hang
> and do necessary actions (usually it is enough to stop hanging nodes to
> unblock PME).
>
> Identification and stopping of nodes blocking PME can be done automatically
> by coordinator node, three scenarios are already described in corresponding
> tickets on IEP page.
> But when stopping nodes we should remember about chance of loosing
> partitions: if nodes identified to be blocking PME hold all copies of a
> partition, partition will be lost if coordinator decides to stop all nodes
> unconditionally.
>
> To give user a choice I propose to add to configuration new policy:
> PMEHangResolvePolicy with three options:
>
>    - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear
>    message with information about hanging nodes and suggestions of how to
> fix
>    the situation;
>    - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only
>    after it checks affinity distribution and makes sure no partitions will
> be
>    lost;
>    - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes
>    unconditionally not making any checks against affinity distribution, so
>    partition loss may happen.
>
>
> What does community think of proposed change? Are there any additional
> cases not covered by tickets or comments about new policy?
>
> Thanks,
> Sergey.
>
> [1]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 25%3A+Partition+Map+Exchange+hangs+resolving
>
> [2]
> https://cwiki.apache.org/confluence/display/IGNITE/%
> 28Partition+Map%29+Exchange+-+under+the+hood
>

Reply via email to