Sergey, Thank you for your answer!
Might be I am looking at the subject from a different angle. > I think of a node in MM as an almost normal one I cannot think of such a mode as a normal one, because it apparently does not perform usual cluster node functions. It is not a part of a cluster, caches data is not available, Discovery and Communication are not needed. I fear that with "node started in a special mode" approach we will get an additional flag in the code making the code more complex and fragile. Should not I worry about it? 2020-09-02 10:45 GMT+03:00, Sergey Chugunov <sergey.chugu...@gmail.com>: > Vladislav, Ivan, > > Thank you for your questions and suggestions. Let me answer them. > > Vladislav, > > If I understood you correctly, you're talking about a node performing some > automatic actions to fix the problem and then join the cluster as usual. > > However the original ticket [1] where we faced the need for Maintenance > Mode is about exactly the opposite: avoid doing automatic actions and give > a user the ability to decide what to do. > > Also the idea of Maintenance Mode is that the node is able to accept > commands, expose metrics and so on, thus we need all components to be > initialized (some of them may be partially initialized due to their own > maintenance). > To achieve that we need to go through a full cycle of node initialization > including discovery initialization. When discovery is initialized (in > special isolated mode) I don't think it is easy to switch back to normal > operations without a restart. > > Ivan, > > I think of a node in MM as an almost normal one (maybe with some components > skipped some steps of their initialization). Commands are accepted, > appropriate metrics are exposed e.g. through JMX API and so on. > > So as I see it we'll have special commands for control.{sh|bat} CLI > allowing user to see reasons why node switched to maintenance mode and/or > trigger actions to fix the problem (I'm still thinking about proper design > of these actions though). > > Of course the user should also be able to fix the problem manually e.g. by > manually deleting corrupted PDS files when node is down. Ideally > Maintenance Mode should be smart enough to figure that out and switch to > normal operations without a restart but I'm not sure if it is possible > without invasive changes of our components' lifecycle. > So I believe this model (node truly started in Maintenance Mode and new > commands in control.{sh|bat}) is a good fit for our current APIs and ways > to interact with the node. > > Does it sound reasonable to you? > > Thank you! > > [1] https://issues.apache.org/jira/browse/IGNITE-13366 > > On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin <vololo...@gmail.com> wrote: > >> Sergey, >> >> Actually, I missed the point that the discussed mode affects a single >> node but not a whole cluster. Perhaps I mixed terms "mode" and >> "state". >> >> My next thoughts about maintenance routines are about special >> utilities. As far as I remember MySQL provides a bunch of scripts for >> various maintenance purposes. What user interface for maintenance >> tasks execution is assumed? And what do we mean by "starting" a node >> in a maintenance mode? Can we do some routines without "starting" >> (e.g. try to recover PDS or cleanup)? >> >> 2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov <vldpyat...@gmail.com>: >> > Hi Sergey. >> > >> > As I understand any switching from/to MM possible only through manual >> > restart a node. >> > But in your example that look like a technical actions, that only >> possible >> > in the case. >> > Do you plan to provide a possibility for client where he can make a >> > decision without a manual intervention? >> > >> > For example: Start node and manually agree with an option and after >> > automatically resolve conflict and back to topology as a stable node. >> > >> > On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov < >> sergey.chugu...@gmail.com> >> > wrote: >> > >> >> Hello Ivan, >> >> >> >> Thank you for raising the good question, I didn't think of Maintenance >> >> Mode >> >> from that perspective. >> >> >> >> In short, Maintenance Mode isn't related to Cluster States concept. >> >> According to javadoc documentation of ClusterState enum [1] it is >> >> solely >> >> about cache operations and to some extent doesn't affect other >> components >> >> of Ignite node. >> >> From APIs perspective putting the methods to manage Cluster State to >> >> IgniteCluster interface doesn't look ideal to me but it is as it is. >> >> >> >> On the other hand Maintenance Mode as I see it will be managed through >> >> different APIs than a ClusterState and this difference definitely will >> be >> >> reflected in the documentation of the feature. >> >> >> >> Ignite node is a complex piece of many components interacting with >> >> each >> >> other, they may have different lifecycles and states; states of >> different >> >> components cannot be reduced to the lowest common denominator. >> >> >> >> However if you have an idea of how to call the feature better to let >> >> the >> >> user easier distinguish it from other similar features please share it >> >> with >> >> us. Personally I'm very welcome to any suggestions that make design >> >> more >> >> intuitive and easy-to-use. >> >> >> >> Thanks! >> >> >> >> [1] >> >> >> >> >> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java >> >> >> >> On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin <vololo...@gmail.com> >> >> wrote: >> >> >> >> > Hi Sergey, >> >> > >> >> > Thank you for bringing attention to that important subject! >> >> > >> >> > My note here is about one more cluster mode. As far as I know >> >> > currently we already have 3 modes (inactive, read-only, read-write) >> >> > and the subject is about one more. From the first glance it could be >> >> > hard for a user to understand and use all modes properly. Do we >> >> > really >> >> > need all spectrum? Could we simplify things somehow? >> >> > >> >> > 2020-08-27 15:59 GMT+03:00, Sergey Chugunov >> >> > <sergey.chugu...@gmail.com>: >> >> > > Hello Nikolay, >> >> > > >> >> > > Created one, available by link [1] >> >> > > >> >> > > Initially there was an intention to develop it under IEP-47 [2] >> >> > > and >> >> there >> >> > > is even a separate section for Maintenance Mode there. >> >> > > But it looks like this feature is useful in more cases and >> >> > > deserves >> >> > > its >> >> > own >> >> > > IEP. >> >> > > >> >> > > [1] >> >> > > >> >> > >> >> >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode >> >> > > [2] >> >> > > >> >> > >> >> >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation >> >> > > >> >> > > On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov >> >> > > <nizhi...@apache.org> >> >> > > wrote: >> >> > > >> >> > >> Hello, Sergey! >> >> > >> >> >> > >> Thanks for the proposal. >> >> > >> Let’s have IEP for this feature. >> >> > >> >> >> > >> > 27 авг. 2020 г., в 10:25, Sergey Chugunov < >> >> sergey.chugu...@gmail.com> >> >> > >> написал(а): >> >> > >> > >> >> > >> > Hello Igniters, >> >> > >> > >> >> > >> > I want to start a discussion about new supporting feature that >> >> > >> > could >> >> > be >> >> > >> > very useful in many scenarios where persistent storage is >> >> > >> > involved: >> >> > >> > Maintenance Mode. >> >> > >> > >> >> > >> > *Summary* >> >> > >> > Maintenance Mode (MM for short) is a special state of Ignite >> >> > >> > node >> >> when >> >> > >> node >> >> > >> > doesn't serve user requests nor joins the cluster but waits for >> >> > >> > user >> >> > >> > commands or performs automatic actions for maintenance >> >> > >> > purposes. >> >> > >> > >> >> > >> > *Motivation* >> >> > >> > There are situations when node cannot participate in regular >> >> > operations >> >> > >> but >> >> > >> > at the same time should not be shut down. >> >> > >> > >> >> > >> > One example is a ticket [1] where I developed the first draft >> >> > >> > of >> >> > >> > Maintenance Mode. >> >> > >> > Here we get into a situation when node has potentially >> >> > >> > corrupted >> >> > >> > PDS >> >> > >> > thus >> >> > >> > cannot proceed with restore routine and join the cluster as >> usual. >> >> > >> > At the same time node should not fail nor be stopped for manual >> >> > >> > cleanup. >> >> > >> > Manual cleanup is not always an option (e.g. restricted access >> >> > >> > to >> >> file >> >> > >> > system); in managed environments failed node will be restarted >> >> > >> > automatically so user won't have time for performing necessary >> >> > >> operations. >> >> > >> > Thus node needs to function in a special mode allowing user to >> >> connect >> >> > >> > to >> >> > >> > it and perform necessary actions. >> >> > >> > >> >> > >> > Another example is described in IEP-47 [2] where >> >> > >> > defragmentation >> >> > >> > is >> >> > >> > being >> >> > >> > developed. Node defragmenting its PDS should not join the >> >> > >> > cluster >> >> > until >> >> > >> the >> >> > >> > process is finished so it needs to enter Maintenance Mode as >> well. >> >> > >> > >> >> > >> > *Suggested design* >> >> > >> > I suggest MM to work as follows: >> >> > >> > 1. Node enters MM if special markers are found on disk. These >> >> markers >> >> > >> > called Maintenance Records could be created automatically (e.g. >> >> > >> > when >> >> > >> > storage component detects corrupted storage) or by user request >> >> (when >> >> > >> user >> >> > >> > requests defragmentation of some caches). So entering MM >> >> > >> > requires >> >> node >> >> > >> > restart. >> >> > >> > 2. Started in MM node doesn't join the cluster but finishes >> >> > >> > startup >> >> > >> routine >> >> > >> > so it is able to receive commands and provide metrics to the >> user. >> >> > >> > 3. When all necessary maintenance operations are finished, >> >> Maintenance >> >> > >> > Records for these operations are deleted from disk and node >> >> restarted >> >> > >> again >> >> > >> > to enter normal service. >> >> > >> > >> >> > >> > *Example* >> >> > >> > To put it into a context let's consider an example of how I see >> >> > >> > the >> >> MM >> >> > >> > workflow in case of PDS corruption. >> >> > >> > >> >> > >> > 1. Node has failed in the middle of checkpoint when WAL is >> >> disabled >> >> > >> > for >> >> > >> > a particular cache -> data files of the cache are potentially >> >> > >> corrupted. >> >> > >> > 2. On next startup node detects this situation, creates >> >> Maintenance >> >> > >> > Record on disk and shuts down. >> >> > >> > 3. On next startup node sees Maintenance Record, enters >> >> Maintenance >> >> > >> Mode >> >> > >> > and waits for user to do specific actions: clean potentially >> >> > >> > corrupted >> >> > >> PDS. >> >> > >> > 4. When user has done necessary actions he/she removes >> >> > >> > Maintenance >> >> > >> > Record using Maintenance Mode API exposed via >> >> > >> > control.{sh|bat} >> >> > script >> >> > >> or >> >> > >> > JMX. >> >> > >> > 5. On next startup node goes to normal operations as >> maintenance >> >> > >> > reason >> >> > >> > is fixed. >> >> > >> > >> >> > >> > >> >> > >> > I prepared a PR [3] for ticket [1] with draft implementation. >> >> > >> > It >> >> > >> > is >> >> > not >> >> > >> > ready to be merged to master branch but is already fully >> >> > >> > functional >> >> > and >> >> > >> can >> >> > >> > be reviewed. >> >> > >> > >> >> > >> > Hope you'll share your feedback on the feature and/or any >> thoughts >> >> on >> >> > >> > implementation. >> >> > >> > >> >> > >> > Thank you! >> >> > >> > >> >> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-13366 >> >> > >> > [2] >> >> > >> > >> >> > >> >> >> > >> >> >> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation >> >> > >> > [3] https://github.com/apache/ignite/pull/8189 >> >> > >> >> >> > >> >> >> > > >> >> > >> >> > >> >> > -- >> >> > >> >> > Best regards, >> >> > Ivan Pavlukhin >> >> > >> >> >> > >> > >> > -- >> > Vladislav Pyatkov >> > >> >> >> -- >> >> Best regards, >> Ivan Pavlukhin >> > -- Best regards, Ivan Pavlukhin