Hello, Sergey! Thanks for the proposal. Let’s have IEP for this feature.
> 27 авг. 2020 г., в 10:25, Sergey Chugunov <sergey.chugu...@gmail.com> > написал(а): > > Hello Igniters, > > I want to start a discussion about new supporting feature that could be > very useful in many scenarios where persistent storage is involved: > Maintenance Mode. > > *Summary* > Maintenance Mode (MM for short) is a special state of Ignite node when node > doesn't serve user requests nor joins the cluster but waits for user > commands or performs automatic actions for maintenance purposes. > > *Motivation* > There are situations when node cannot participate in regular operations but > at the same time should not be shut down. > > One example is a ticket [1] where I developed the first draft of > Maintenance Mode. > Here we get into a situation when node has potentially corrupted PDS thus > cannot proceed with restore routine and join the cluster as usual. > At the same time node should not fail nor be stopped for manual cleanup. > Manual cleanup is not always an option (e.g. restricted access to file > system); in managed environments failed node will be restarted > automatically so user won't have time for performing necessary operations. > Thus node needs to function in a special mode allowing user to connect to > it and perform necessary actions. > > Another example is described in IEP-47 [2] where defragmentation is being > developed. Node defragmenting its PDS should not join the cluster until the > process is finished so it needs to enter Maintenance Mode as well. > > *Suggested design* > I suggest MM to work as follows: > 1. Node enters MM if special markers are found on disk. These markers > called Maintenance Records could be created automatically (e.g. when > storage component detects corrupted storage) or by user request (when user > requests defragmentation of some caches). So entering MM requires node > restart. > 2. Started in MM node doesn't join the cluster but finishes startup routine > so it is able to receive commands and provide metrics to the user. > 3. When all necessary maintenance operations are finished, Maintenance > Records for these operations are deleted from disk and node restarted again > to enter normal service. > > *Example* > To put it into a context let's consider an example of how I see the MM > workflow in case of PDS corruption. > > 1. Node has failed in the middle of checkpoint when WAL is disabled for > a particular cache -> data files of the cache are potentially corrupted. > 2. On next startup node detects this situation, creates Maintenance > Record on disk and shuts down. > 3. On next startup node sees Maintenance Record, enters Maintenance Mode > and waits for user to do specific actions: clean potentially corrupted PDS. > 4. When user has done necessary actions he/she removes Maintenance > Record using Maintenance Mode API exposed via control.{sh|bat} script or > JMX. > 5. On next startup node goes to normal operations as maintenance reason > is fixed. > > > I prepared a PR [3] for ticket [1] with draft implementation. It is not > ready to be merged to master branch but is already fully functional and can > be reviewed. > > Hope you'll share your feedback on the feature and/or any thoughts on > implementation. > > Thank you! > > [1] https://issues.apache.org/jira/browse/IGNITE-13366 > [2] > https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation > [3] https://github.com/apache/ignite/pull/8189