Re: [DISCUSSION] Maintenance Mode feature

Ivan Pavlukhin Tue, 01 Sep 2020 04:07:21 -0700

Sergey,

Actually, I missed the point that the discussed mode affects a single
node but not a whole cluster. Perhaps I mixed terms "mode" and
"state".


My next thoughts about maintenance routines are about special
utilities. As far as I remember MySQL provides a bunch of scripts for
various maintenance purposes. What user interface for maintenance
tasks execution is assumed? And what do we mean by "starting" a node
in a maintenance mode? Can we do some routines without "starting"
(e.g. try to recover PDS or cleanup)?

2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov <vldpyat...@gmail.com>:
> Hi Sergey.
>
> As I understand any switching from/to MM possible only through manual
> restart a node.
> But in your example that look like a technical actions, that only possible
> in the case.
> Do you plan to provide a possibility for client where he can make a
> decision without a manual intervention?
>
> For example: Start node and manually agree with an option and after
> automatically resolve conflict and back to topology as a stable node.
>
> On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov <sergey.chugu...@gmail.com>
> wrote:
>
>> Hello Ivan,
>>
>> Thank you for raising the good question, I didn't think of Maintenance
>> Mode
>> from that perspective.
>>
>> In short, Maintenance Mode isn't related to Cluster States concept.
>> According to javadoc documentation of ClusterState enum [1] it is solely
>> about cache operations and to some extent doesn't affect other components
>> of Ignite node.
>> From APIs perspective putting the methods to manage Cluster State to
>> IgniteCluster interface doesn't look ideal to me but it is as it is.
>>
>> On the other hand Maintenance Mode as I see it will be managed through
>> different APIs than a ClusterState and this difference definitely will be
>> reflected in the documentation of the feature.
>>
>> Ignite node is a complex piece of many components interacting with each
>> other, they may have different lifecycles and states; states of different
>> components cannot be reduced to the lowest common denominator.
>>
>> However if you have an idea of how to call the feature better to let the
>> user easier distinguish it from other similar features please share it
>> with
>> us. Personally I'm very welcome to any suggestions that make design more
>> intuitive and easy-to-use.
>>
>> Thanks!
>>
>> [1]
>>
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java
>>
>> On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin <vololo...@gmail.com>
>> wrote:
>>
>> > Hi Sergey,
>> >
>> > Thank you for bringing attention to that important subject!
>> >
>> > My note here is about one more cluster mode. As far as I know
>> > currently we already have 3 modes (inactive, read-only, read-write)
>> > and the subject is about one more. From the first glance it could be
>> > hard for a user to understand and use all modes properly. Do we really
>> > need all spectrum? Could we simplify things somehow?
>> >
>> > 2020-08-27 15:59 GMT+03:00, Sergey Chugunov
>> > <sergey.chugu...@gmail.com>:
>> > > Hello Nikolay,
>> > >
>> > > Created one, available by link [1]
>> > >
>> > > Initially there was an intention to develop it under IEP-47 [2] and
>> there
>> > > is even a separate section for Maintenance Mode there.
>> > > But it looks like this feature is useful in more cases and deserves
>> > > its
>> > own
>> > > IEP.
>> > >
>> > > [1]
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
>> > > [2]
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>> > >
>> > > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov
>> > > <nizhi...@apache.org>
>> > > wrote:
>> > >
>> > >> Hello, Sergey!
>> > >>
>> > >> Thanks for the proposal.
>> > >> Let’s have IEP for this feature.
>> > >>
>> > >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov <
>> sergey.chugu...@gmail.com>
>> > >> написал(а):
>> > >> >
>> > >> > Hello Igniters,
>> > >> >
>> > >> > I want to start a discussion about new supporting feature that
>> > >> > could
>> > be
>> > >> > very useful in many scenarios where persistent storage is
>> > >> > involved:
>> > >> > Maintenance Mode.
>> > >> >
>> > >> > *Summary*
>> > >> > Maintenance Mode (MM for short) is a special state of Ignite node
>> when
>> > >> node
>> > >> > doesn't serve user requests nor joins the cluster but waits for
>> > >> > user
>> > >> > commands or performs automatic actions for maintenance purposes.
>> > >> >
>> > >> > *Motivation*
>> > >> > There are situations when node cannot participate in regular
>> > operations
>> > >> but
>> > >> > at the same time should not be shut down.
>> > >> >
>> > >> > One example is a ticket [1] where I developed the first draft of
>> > >> > Maintenance Mode.
>> > >> > Here we get into a situation when node has potentially corrupted
>> > >> > PDS
>> > >> > thus
>> > >> > cannot proceed with restore routine and join the cluster as usual.
>> > >> > At the same time node should not fail nor be stopped for manual
>> > >> > cleanup.
>> > >> > Manual cleanup is not always an option (e.g. restricted access to
>> file
>> > >> > system); in managed environments failed node will be restarted
>> > >> > automatically so user won't have time for performing necessary
>> > >> operations.
>> > >> > Thus node needs to function in a special mode allowing user to
>> connect
>> > >> > to
>> > >> > it and perform necessary actions.
>> > >> >
>> > >> > Another example is described in IEP-47 [2] where defragmentation
>> > >> > is
>> > >> > being
>> > >> > developed. Node defragmenting its PDS should not join the cluster
>> > until
>> > >> the
>> > >> > process is finished so it needs to enter Maintenance Mode as well.
>> > >> >
>> > >> > *Suggested design*
>> > >> > I suggest MM to work as follows:
>> > >> > 1. Node enters MM if special markers are found on disk. These
>> markers
>> > >> > called Maintenance Records could be created automatically (e.g.
>> > >> > when
>> > >> > storage component detects corrupted storage) or by user request
>> (when
>> > >> user
>> > >> > requests defragmentation of some caches). So entering MM requires
>> node
>> > >> > restart.
>> > >> > 2. Started in MM node doesn't join the cluster but finishes
>> > >> > startup
>> > >> routine
>> > >> > so it is able to receive commands and provide metrics to the user.
>> > >> > 3. When all necessary maintenance operations are finished,
>> Maintenance
>> > >> > Records for these operations are deleted from disk and node
>> restarted
>> > >> again
>> > >> > to enter normal service.
>> > >> >
>> > >> > *Example*
>> > >> > To put it into a context let's consider an example of how I see
>> > >> > the
>> MM
>> > >> > workflow in case of PDS corruption.
>> > >> >
>> > >> >   1. Node has failed in the middle of checkpoint when WAL is
>> disabled
>> > >> > for
>> > >> >   a particular cache -> data files of the cache are potentially
>> > >> corrupted.
>> > >> >   2. On next startup node detects this situation, creates
>> Maintenance
>> > >> >   Record on disk and shuts down.
>> > >> >   3. On next startup node sees Maintenance Record, enters
>> Maintenance
>> > >> Mode
>> > >> >   and waits for user to do specific actions: clean potentially
>> > >> > corrupted
>> > >> PDS.
>> > >> >   4. When user has done necessary actions he/she removes
>> > >> > Maintenance
>> > >> >   Record using Maintenance Mode API exposed via control.{sh|bat}
>> > script
>> > >> or
>> > >> >   JMX.
>> > >> >   5. On next startup node goes to normal operations as maintenance
>> > >> > reason
>> > >> >   is fixed.
>> > >> >
>> > >> >
>> > >> > I prepared a PR [3] for ticket [1] with draft implementation. It
>> > >> > is
>> > not
>> > >> > ready to be merged to master branch but is already fully
>> > >> > functional
>> > and
>> > >> can
>> > >> > be reviewed.
>> > >> >
>> > >> > Hope you'll share your feedback on the feature and/or any thoughts
>> on
>> > >> > implementation.
>> > >> >
>> > >> > Thank you!
>> > >> >
>> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-13366
>> > >> > [2]
>> > >> >
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>> > >> > [3] https://github.com/apache/ignite/pull/8189
>> > >>
>> > >>
>> > >
>> >
>> >
>> > --
>> >
>> > Best regards,
>> > Ivan Pavlukhin
>> >
>>
>
>
> --
> Vladislav Pyatkov
>


-- 

Best regards,
Ivan Pavlukhin

Re: [DISCUSSION] Maintenance Mode feature

Reply via email to