Re: IEP-25: Partition Map Exchange hangs resolving

Sergey Chugunov Fri, 22 Jun 2018 06:14:50 -0700

Dmitriy,

Answering to your questions:


1) policy applies only to servers as coordinator never waits for clients in
PME protocol;
2) we cannot restart handing node automatically only stop it. Node restart
should be responsibility of monitoring system or end user.

--
Thanks,
Sergey.

On Thu, Jun 21, 2018 at 4:21 PM Dmitriy Setrakyan <dsetrak...@apache.org>
wrote:

> Hi Sergey,
>
> This is probably the most important IEP we have. I am assuming that after
> this gets fixed, Ignite cluster will never come to a freezing state.
>
> I propose to name the enum *PmeStopPolicy*. Here are my suggestions:
>
>    - NONE - will result in logging the state
>    - STOP_PRESERVE_PARTITIONS - nodes will be stopped, as long as every
>    partition has at least one copy in the cluster
>    - STOP_ALL - all frozen nodes will be stopped, if partitions are lost,
>    cluster will enter read-only state and will not serve data for the lost
>    partitions.
>
> I also have some questions:
>
> - Does this policy apply only to the server nodes, or to client nodes as
> well?
> - Can the nodes be automatically restarted?
>
> D.
>
>
> On Thu, Jun 21, 2018 at 5:14 AM, Sergey Chugunov <
> sergey.chugu...@gmail.com>
> wrote:
>
> > Igniters,
> >
> > I've created new IEP [1] to address important case when Partition Map
> > Exchange process (for more info on it refer to [2]) hangs for some
> reason.
> >
> > If this happens user now has to manually identify nodes causing PME to
> hang
> > and do necessary actions (usually it is enough to stop hanging nodes to
> > unblock PME).
> >
> > Identification and stopping of nodes blocking PME can be done
> automatically
> > by coordinator node, three scenarios are already described in
> corresponding
> > tickets on IEP page.
> > But when stopping nodes we should remember about chance of loosing
> > partitions: if nodes identified to be blocking PME hold all copies of a
> > partition, partition will be lost if coordinator decides to stop all
> nodes
> > unconditionally.
> >
> > To give user a choice I propose to add to configuration new policy:
> > PMEHangResolvePolicy with three options:
> >
> >    - LOG_NOTIFICATION: coordinator doesn't do any actions but logs clear
> >    message with information about hanging nodes and suggestions of how to
> > fix
> >    the situation;
> >    - STOP_NODES_PARTITION_LOSS_SAFE: coordinator stops hanging nodes only
> >    after it checks affinity distribution and makes sure no partitions
> will
> > be
> >    lost;
> >    - STOP_ALL_HANGING_NODES: coordinator stops all hanging nodes
> >    unconditionally not making any checks against affinity distribution,
> so
> >    partition loss may happen.
> >
> >
> > What does community think of proposed change? Are there any additional
> > cases not covered by tickets or comments about new policy?
> >
> > Thanks,
> > Sergey.
> >
> > [1]
> > https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> > 25%3A+Partition+Map+Exchange+hangs+resolving
> >
> > [2]
> > https://cwiki.apache.org/confluence/display/IGNITE/%
> > 28Partition+Map%29+Exchange+-+under+the+hood
> >
>

Re: IEP-25: Partition Map Exchange hangs resolving

Reply via email to