Sergey, Actually, I missed the point that the discussed mode affects a single node but not a whole cluster. Perhaps I mixed terms "mode" and "state".
My next thoughts about maintenance routines are about special utilities. As far as I remember MySQL provides a bunch of scripts for various maintenance purposes. What user interface for maintenance tasks execution is assumed? And what do we mean by "starting" a node in a maintenance mode? Can we do some routines without "starting" (e.g. try to recover PDS or cleanup)? 2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov <vldpyat...@gmail.com>: > Hi Sergey. > > As I understand any switching from/to MM possible only through manual > restart a node. > But in your example that look like a technical actions, that only possible > in the case. > Do you plan to provide a possibility for client where he can make a > decision without a manual intervention? > > For example: Start node and manually agree with an option and after > automatically resolve conflict and back to topology as a stable node. > > On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov <sergey.chugu...@gmail.com> > wrote: > >> Hello Ivan, >> >> Thank you for raising the good question, I didn't think of Maintenance >> Mode >> from that perspective. >> >> In short, Maintenance Mode isn't related to Cluster States concept. >> According to javadoc documentation of ClusterState enum [1] it is solely >> about cache operations and to some extent doesn't affect other components >> of Ignite node. >> From APIs perspective putting the methods to manage Cluster State to >> IgniteCluster interface doesn't look ideal to me but it is as it is. >> >> On the other hand Maintenance Mode as I see it will be managed through >> different APIs than a ClusterState and this difference definitely will be >> reflected in the documentation of the feature. >> >> Ignite node is a complex piece of many components interacting with each >> other, they may have different lifecycles and states; states of different >> components cannot be reduced to the lowest common denominator. >> >> However if you have an idea of how to call the feature better to let the >> user easier distinguish it from other similar features please share it >> with >> us. Personally I'm very welcome to any suggestions that make design more >> intuitive and easy-to-use. >> >> Thanks! >> >> [1] >> >> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java >> >> On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin <vololo...@gmail.com> >> wrote: >> >> > Hi Sergey, >> > >> > Thank you for bringing attention to that important subject! >> > >> > My note here is about one more cluster mode. As far as I know >> > currently we already have 3 modes (inactive, read-only, read-write) >> > and the subject is about one more. From the first glance it could be >> > hard for a user to understand and use all modes properly. Do we really >> > need all spectrum? Could we simplify things somehow? >> > >> > 2020-08-27 15:59 GMT+03:00, Sergey Chugunov >> > <sergey.chugu...@gmail.com>: >> > > Hello Nikolay, >> > > >> > > Created one, available by link [1] >> > > >> > > Initially there was an intention to develop it under IEP-47 [2] and >> there >> > > is even a separate section for Maintenance Mode there. >> > > But it looks like this feature is useful in more cases and deserves >> > > its >> > own >> > > IEP. >> > > >> > > [1] >> > > >> > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode >> > > [2] >> > > >> > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation >> > > >> > > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov >> > > <nizhi...@apache.org> >> > > wrote: >> > > >> > >> Hello, Sergey! >> > >> >> > >> Thanks for the proposal. >> > >> Let’s have IEP for this feature. >> > >> >> > >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov < >> sergey.chugu...@gmail.com> >> > >> написал(а): >> > >> > >> > >> > Hello Igniters, >> > >> > >> > >> > I want to start a discussion about new supporting feature that >> > >> > could >> > be >> > >> > very useful in many scenarios where persistent storage is >> > >> > involved: >> > >> > Maintenance Mode. >> > >> > >> > >> > *Summary* >> > >> > Maintenance Mode (MM for short) is a special state of Ignite node >> when >> > >> node >> > >> > doesn't serve user requests nor joins the cluster but waits for >> > >> > user >> > >> > commands or performs automatic actions for maintenance purposes. >> > >> > >> > >> > *Motivation* >> > >> > There are situations when node cannot participate in regular >> > operations >> > >> but >> > >> > at the same time should not be shut down. >> > >> > >> > >> > One example is a ticket [1] where I developed the first draft of >> > >> > Maintenance Mode. >> > >> > Here we get into a situation when node has potentially corrupted >> > >> > PDS >> > >> > thus >> > >> > cannot proceed with restore routine and join the cluster as usual. >> > >> > At the same time node should not fail nor be stopped for manual >> > >> > cleanup. >> > >> > Manual cleanup is not always an option (e.g. restricted access to >> file >> > >> > system); in managed environments failed node will be restarted >> > >> > automatically so user won't have time for performing necessary >> > >> operations. >> > >> > Thus node needs to function in a special mode allowing user to >> connect >> > >> > to >> > >> > it and perform necessary actions. >> > >> > >> > >> > Another example is described in IEP-47 [2] where defragmentation >> > >> > is >> > >> > being >> > >> > developed. Node defragmenting its PDS should not join the cluster >> > until >> > >> the >> > >> > process is finished so it needs to enter Maintenance Mode as well. >> > >> > >> > >> > *Suggested design* >> > >> > I suggest MM to work as follows: >> > >> > 1. Node enters MM if special markers are found on disk. These >> markers >> > >> > called Maintenance Records could be created automatically (e.g. >> > >> > when >> > >> > storage component detects corrupted storage) or by user request >> (when >> > >> user >> > >> > requests defragmentation of some caches). So entering MM requires >> node >> > >> > restart. >> > >> > 2. Started in MM node doesn't join the cluster but finishes >> > >> > startup >> > >> routine >> > >> > so it is able to receive commands and provide metrics to the user. >> > >> > 3. When all necessary maintenance operations are finished, >> Maintenance >> > >> > Records for these operations are deleted from disk and node >> restarted >> > >> again >> > >> > to enter normal service. >> > >> > >> > >> > *Example* >> > >> > To put it into a context let's consider an example of how I see >> > >> > the >> MM >> > >> > workflow in case of PDS corruption. >> > >> > >> > >> > 1. Node has failed in the middle of checkpoint when WAL is >> disabled >> > >> > for >> > >> > a particular cache -> data files of the cache are potentially >> > >> corrupted. >> > >> > 2. On next startup node detects this situation, creates >> Maintenance >> > >> > Record on disk and shuts down. >> > >> > 3. On next startup node sees Maintenance Record, enters >> Maintenance >> > >> Mode >> > >> > and waits for user to do specific actions: clean potentially >> > >> > corrupted >> > >> PDS. >> > >> > 4. When user has done necessary actions he/she removes >> > >> > Maintenance >> > >> > Record using Maintenance Mode API exposed via control.{sh|bat} >> > script >> > >> or >> > >> > JMX. >> > >> > 5. On next startup node goes to normal operations as maintenance >> > >> > reason >> > >> > is fixed. >> > >> > >> > >> > >> > >> > I prepared a PR [3] for ticket [1] with draft implementation. It >> > >> > is >> > not >> > >> > ready to be merged to master branch but is already fully >> > >> > functional >> > and >> > >> can >> > >> > be reviewed. >> > >> > >> > >> > Hope you'll share your feedback on the feature and/or any thoughts >> on >> > >> > implementation. >> > >> > >> > >> > Thank you! >> > >> > >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-13366 >> > >> > [2] >> > >> > >> > >> >> > >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation >> > >> > [3] https://github.com/apache/ignite/pull/8189 >> > >> >> > >> >> > > >> > >> > >> > -- >> > >> > Best regards, >> > Ivan Pavlukhin >> > >> > > > -- > Vladislav Pyatkov > -- Best regards, Ivan Pavlukhin