Re: [DISCUSSION] Maintenance Mode feature

Vladislav Pyatkov Mon, 31 Aug 2020 13:42:16 -0700

Hi Sergey.

As I understand any switching from/to MM possible only through manual
restart a node.
But in your example that look like a technical actions, that only possible
in the case.
Do you plan to provide a possibility for client where he can make a
decision without a manual intervention?


For example: Start node and manually agree with an option and after
automatically resolve conflict and back to topology as a stable node.

On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov <sergey.chugu...@gmail.com>
wrote:

> Hello Ivan,
>
> Thank you for raising the good question, I didn't think of Maintenance Mode
> from that perspective.
>
> In short, Maintenance Mode isn't related to Cluster States concept.
> According to javadoc documentation of ClusterState enum [1] it is solely
> about cache operations and to some extent doesn't affect other components
> of Ignite node.
> From APIs perspective putting the methods to manage Cluster State to
> IgniteCluster interface doesn't look ideal to me but it is as it is.
>
> On the other hand Maintenance Mode as I see it will be managed through
> different APIs than a ClusterState and this difference definitely will be
> reflected in the documentation of the feature.
>
> Ignite node is a complex piece of many components interacting with each
> other, they may have different lifecycles and states; states of different
> components cannot be reduced to the lowest common denominator.
>
> However if you have an idea of how to call the feature better to let the
> user easier distinguish it from other similar features please share it with
> us. Personally I'm very welcome to any suggestions that make design more
> intuitive and easy-to-use.
>
> Thanks!
>
> [1]
>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java
>
> On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin <vololo...@gmail.com>
> wrote:
>
> > Hi Sergey,
> >
> > Thank you for bringing attention to that important subject!
> >
> > My note here is about one more cluster mode. As far as I know
> > currently we already have 3 modes (inactive, read-only, read-write)
> > and the subject is about one more. From the first glance it could be
> > hard for a user to understand and use all modes properly. Do we really
> > need all spectrum? Could we simplify things somehow?
> >
> > 2020-08-27 15:59 GMT+03:00, Sergey Chugunov <sergey.chugu...@gmail.com>:
> > > Hello Nikolay,
> > >
> > > Created one, available by link [1]
> > >
> > > Initially there was an intention to develop it under IEP-47 [2] and
> there
> > > is even a separate section for Maintenance Mode there.
> > > But it looks like this feature is useful in more cases and deserves its
> > own
> > > IEP.
> > >
> > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> > >
> > > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov <nizhi...@apache.org>
> > > wrote:
> > >
> > >> Hello, Sergey!
> > >>
> > >> Thanks for the proposal.
> > >> Let’s have IEP for this feature.
> > >>
> > >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov <
> sergey.chugu...@gmail.com>
> > >> написал(а):
> > >> >
> > >> > Hello Igniters,
> > >> >
> > >> > I want to start a discussion about new supporting feature that could
> > be
> > >> > very useful in many scenarios where persistent storage is involved:
> > >> > Maintenance Mode.
> > >> >
> > >> > *Summary*
> > >> > Maintenance Mode (MM for short) is a special state of Ignite node
> when
> > >> node
> > >> > doesn't serve user requests nor joins the cluster but waits for user
> > >> > commands or performs automatic actions for maintenance purposes.
> > >> >
> > >> > *Motivation*
> > >> > There are situations when node cannot participate in regular
> > operations
> > >> but
> > >> > at the same time should not be shut down.
> > >> >
> > >> > One example is a ticket [1] where I developed the first draft of
> > >> > Maintenance Mode.
> > >> > Here we get into a situation when node has potentially corrupted PDS
> > >> > thus
> > >> > cannot proceed with restore routine and join the cluster as usual.
> > >> > At the same time node should not fail nor be stopped for manual
> > >> > cleanup.
> > >> > Manual cleanup is not always an option (e.g. restricted access to
> file
> > >> > system); in managed environments failed node will be restarted
> > >> > automatically so user won't have time for performing necessary
> > >> operations.
> > >> > Thus node needs to function in a special mode allowing user to
> connect
> > >> > to
> > >> > it and perform necessary actions.
> > >> >
> > >> > Another example is described in IEP-47 [2] where defragmentation is
> > >> > being
> > >> > developed. Node defragmenting its PDS should not join the cluster
> > until
> > >> the
> > >> > process is finished so it needs to enter Maintenance Mode as well.
> > >> >
> > >> > *Suggested design*
> > >> > I suggest MM to work as follows:
> > >> > 1. Node enters MM if special markers are found on disk. These
> markers
> > >> > called Maintenance Records could be created automatically (e.g. when
> > >> > storage component detects corrupted storage) or by user request
> (when
> > >> user
> > >> > requests defragmentation of some caches). So entering MM requires
> node
> > >> > restart.
> > >> > 2. Started in MM node doesn't join the cluster but finishes startup
> > >> routine
> > >> > so it is able to receive commands and provide metrics to the user.
> > >> > 3. When all necessary maintenance operations are finished,
> Maintenance
> > >> > Records for these operations are deleted from disk and node
> restarted
> > >> again
> > >> > to enter normal service.
> > >> >
> > >> > *Example*
> > >> > To put it into a context let's consider an example of how I see the
> MM
> > >> > workflow in case of PDS corruption.
> > >> >
> > >> >   1. Node has failed in the middle of checkpoint when WAL is
> disabled
> > >> > for
> > >> >   a particular cache -> data files of the cache are potentially
> > >> corrupted.
> > >> >   2. On next startup node detects this situation, creates
> Maintenance
> > >> >   Record on disk and shuts down.
> > >> >   3. On next startup node sees Maintenance Record, enters
> Maintenance
> > >> Mode
> > >> >   and waits for user to do specific actions: clean potentially
> > >> > corrupted
> > >> PDS.
> > >> >   4. When user has done necessary actions he/she removes Maintenance
> > >> >   Record using Maintenance Mode API exposed via control.{sh|bat}
> > script
> > >> or
> > >> >   JMX.
> > >> >   5. On next startup node goes to normal operations as maintenance
> > >> > reason
> > >> >   is fixed.
> > >> >
> > >> >
> > >> > I prepared a PR [3] for ticket [1] with draft implementation. It is
> > not
> > >> > ready to be merged to master branch but is already fully functional
> > and
> > >> can
> > >> > be reviewed.
> > >> >
> > >> > Hope you'll share your feedback on the feature and/or any thoughts
> on
> > >> > implementation.
> > >> >
> > >> > Thank you!
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
> > >> > [2]
> > >> >
> > >>
> >
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> > >> > [3] https://github.com/apache/ignite/pull/8189
> > >>
> > >>
> > >
> >
> >
> > --
> >
> > Best regards,
> > Ivan Pavlukhin
> >
>


-- 
Vladislav Pyatkov

Re: [DISCUSSION] Maintenance Mode feature

Reply via email to