Hello, Sergey!

Thanks for the proposal.
Let’s have IEP for this feature.

> 27 авг. 2020 г., в 10:25, Sergey Chugunov <sergey.chugu...@gmail.com> 
> написал(а):
> 
> Hello Igniters,
> 
> I want to start a discussion about new supporting feature that could be
> very useful in many scenarios where persistent storage is involved:
> Maintenance Mode.
> 
> *Summary*
> Maintenance Mode (MM for short) is a special state of Ignite node when node
> doesn't serve user requests nor joins the cluster but waits for user
> commands or performs automatic actions for maintenance purposes.
> 
> *Motivation*
> There are situations when node cannot participate in regular operations but
> at the same time should not be shut down.
> 
> One example is a ticket [1] where I developed the first draft of
> Maintenance Mode.
> Here we get into a situation when node has potentially corrupted PDS thus
> cannot proceed with restore routine and join the cluster as usual.
> At the same time node should not fail nor be stopped for manual cleanup.
> Manual cleanup is not always an option (e.g. restricted access to file
> system); in managed environments failed node will be restarted
> automatically so user won't have time for performing necessary operations.
> Thus node needs to function in a special mode allowing user to connect to
> it and perform necessary actions.
> 
> Another example is described in IEP-47 [2] where defragmentation is being
> developed. Node defragmenting its PDS should not join the cluster until the
> process is finished so it needs to enter Maintenance Mode as well.
> 
> *Suggested design*
> I suggest MM to work as follows:
> 1. Node enters MM if special markers are found on disk. These markers
> called Maintenance Records could be created automatically (e.g. when
> storage component detects corrupted storage) or by user request (when user
> requests defragmentation of some caches). So entering MM requires node
> restart.
> 2. Started in MM node doesn't join the cluster but finishes startup routine
> so it is able to receive commands and provide metrics to the user.
> 3. When all necessary maintenance operations are finished, Maintenance
> Records for these operations are deleted from disk and node restarted again
> to enter normal service.
> 
> *Example*
> To put it into a context let's consider an example of how I see the MM
> workflow in case of PDS corruption.
> 
>   1. Node has failed in the middle of checkpoint when WAL is disabled for
>   a particular cache -> data files of the cache are potentially corrupted.
>   2. On next startup node detects this situation, creates Maintenance
>   Record on disk and shuts down.
>   3. On next startup node sees Maintenance Record, enters Maintenance Mode
>   and waits for user to do specific actions: clean potentially corrupted PDS.
>   4. When user has done necessary actions he/she removes Maintenance
>   Record using Maintenance Mode API exposed via control.{sh|bat} script or
>   JMX.
>   5. On next startup node goes to normal operations as maintenance reason
>   is fixed.
> 
> 
> I prepared a PR [3] for ticket [1] with draft implementation. It is not
> ready to be merged to master branch but is already fully functional and can
> be reviewed.
> 
> Hope you'll share your feedback on the feature and/or any thoughts on
> implementation.
> 
> Thank you!
> 
> [1] https://issues.apache.org/jira/browse/IGNITE-13366
> [2]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
> [3] https://github.com/apache/ignite/pull/8189

Reply via email to