Hi Sergey. As I understand any switching from/to MM possible only through manual restart a node. But in your example that look like a technical actions, that only possible in the case. Do you plan to provide a possibility for client where he can make a decision without a manual intervention?
For example: Start node and manually agree with an option and after automatically resolve conflict and back to topology as a stable node. On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov <sergey.chugu...@gmail.com> wrote: > Hello Ivan, > > Thank you for raising the good question, I didn't think of Maintenance Mode > from that perspective. > > In short, Maintenance Mode isn't related to Cluster States concept. > According to javadoc documentation of ClusterState enum [1] it is solely > about cache operations and to some extent doesn't affect other components > of Ignite node. > From APIs perspective putting the methods to manage Cluster State to > IgniteCluster interface doesn't look ideal to me but it is as it is. > > On the other hand Maintenance Mode as I see it will be managed through > different APIs than a ClusterState and this difference definitely will be > reflected in the documentation of the feature. > > Ignite node is a complex piece of many components interacting with each > other, they may have different lifecycles and states; states of different > components cannot be reduced to the lowest common denominator. > > However if you have an idea of how to call the feature better to let the > user easier distinguish it from other similar features please share it with > us. Personally I'm very welcome to any suggestions that make design more > intuitive and easy-to-use. > > Thanks! > > [1] > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java > > On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin <vololo...@gmail.com> > wrote: > > > Hi Sergey, > > > > Thank you for bringing attention to that important subject! > > > > My note here is about one more cluster mode. As far as I know > > currently we already have 3 modes (inactive, read-only, read-write) > > and the subject is about one more. From the first glance it could be > > hard for a user to understand and use all modes properly. Do we really > > need all spectrum? Could we simplify things somehow? > > > > 2020-08-27 15:59 GMT+03:00, Sergey Chugunov <sergey.chugu...@gmail.com>: > > > Hello Nikolay, > > > > > > Created one, available by link [1] > > > > > > Initially there was an intention to develop it under IEP-47 [2] and > there > > > is even a separate section for Maintenance Mode there. > > > But it looks like this feature is useful in more cases and deserves its > > own > > > IEP. > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode > > > [2] > > > > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation > > > > > > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov <nizhi...@apache.org> > > > wrote: > > > > > >> Hello, Sergey! > > >> > > >> Thanks for the proposal. > > >> Let’s have IEP for this feature. > > >> > > >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov < > sergey.chugu...@gmail.com> > > >> написал(а): > > >> > > > >> > Hello Igniters, > > >> > > > >> > I want to start a discussion about new supporting feature that could > > be > > >> > very useful in many scenarios where persistent storage is involved: > > >> > Maintenance Mode. > > >> > > > >> > *Summary* > > >> > Maintenance Mode (MM for short) is a special state of Ignite node > when > > >> node > > >> > doesn't serve user requests nor joins the cluster but waits for user > > >> > commands or performs automatic actions for maintenance purposes. > > >> > > > >> > *Motivation* > > >> > There are situations when node cannot participate in regular > > operations > > >> but > > >> > at the same time should not be shut down. > > >> > > > >> > One example is a ticket [1] where I developed the first draft of > > >> > Maintenance Mode. > > >> > Here we get into a situation when node has potentially corrupted PDS > > >> > thus > > >> > cannot proceed with restore routine and join the cluster as usual. > > >> > At the same time node should not fail nor be stopped for manual > > >> > cleanup. > > >> > Manual cleanup is not always an option (e.g. restricted access to > file > > >> > system); in managed environments failed node will be restarted > > >> > automatically so user won't have time for performing necessary > > >> operations. > > >> > Thus node needs to function in a special mode allowing user to > connect > > >> > to > > >> > it and perform necessary actions. > > >> > > > >> > Another example is described in IEP-47 [2] where defragmentation is > > >> > being > > >> > developed. Node defragmenting its PDS should not join the cluster > > until > > >> the > > >> > process is finished so it needs to enter Maintenance Mode as well. > > >> > > > >> > *Suggested design* > > >> > I suggest MM to work as follows: > > >> > 1. Node enters MM if special markers are found on disk. These > markers > > >> > called Maintenance Records could be created automatically (e.g. when > > >> > storage component detects corrupted storage) or by user request > (when > > >> user > > >> > requests defragmentation of some caches). So entering MM requires > node > > >> > restart. > > >> > 2. Started in MM node doesn't join the cluster but finishes startup > > >> routine > > >> > so it is able to receive commands and provide metrics to the user. > > >> > 3. When all necessary maintenance operations are finished, > Maintenance > > >> > Records for these operations are deleted from disk and node > restarted > > >> again > > >> > to enter normal service. > > >> > > > >> > *Example* > > >> > To put it into a context let's consider an example of how I see the > MM > > >> > workflow in case of PDS corruption. > > >> > > > >> > 1. Node has failed in the middle of checkpoint when WAL is > disabled > > >> > for > > >> > a particular cache -> data files of the cache are potentially > > >> corrupted. > > >> > 2. On next startup node detects this situation, creates > Maintenance > > >> > Record on disk and shuts down. > > >> > 3. On next startup node sees Maintenance Record, enters > Maintenance > > >> Mode > > >> > and waits for user to do specific actions: clean potentially > > >> > corrupted > > >> PDS. > > >> > 4. When user has done necessary actions he/she removes Maintenance > > >> > Record using Maintenance Mode API exposed via control.{sh|bat} > > script > > >> or > > >> > JMX. > > >> > 5. On next startup node goes to normal operations as maintenance > > >> > reason > > >> > is fixed. > > >> > > > >> > > > >> > I prepared a PR [3] for ticket [1] with draft implementation. It is > > not > > >> > ready to be merged to master branch but is already fully functional > > and > > >> can > > >> > be reviewed. > > >> > > > >> > Hope you'll share your feedback on the feature and/or any thoughts > on > > >> > implementation. > > >> > > > >> > Thank you! > > >> > > > >> > [1] https://issues.apache.org/jira/browse/IGNITE-13366 > > >> > [2] > > >> > > > >> > > > https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation > > >> > [3] https://github.com/apache/ignite/pull/8189 > > >> > > >> > > > > > > > > > -- > > > > Best regards, > > Ivan Pavlukhin > > > -- Vladislav Pyatkov