+1 to make new master key name explicit parameter.
> 29 сент. 2020 г., в 16:35, Sergey Chugunov <sergey.chugu...@gmail.com>
> написал(а):
>
> Hello Nikolay,
>
>> AFAIKU There is third use-case for this mode.
>
> Sorry for the late reply.
>
> I took a look at the code and maintenance mode indeed looks a good match
> for changing master key situation.
>
> I want to clarify only one thing. In current implementation we pass new
> master key name via system property. Do you think of getting rid of this
> property and passing new master key name to encryption manager with
> maintenance parameters? In terms of original IEP it is parameters passed
> with MaintenanceRecord.
>
> --
> Thanks!
>
> On Mon, Sep 21, 2020 at 3:20 PM Nikolay Izhikov <nizhi...@apache.org> wrote:
>
>> Hello, Sergey.
>>
>>> At the moment I'm aware about two use cases for this feature: corrupted
>> PDS cleanup and defragmentation.
>>
>> AFAIKU There is third use-case for this mode.
>>
>> Change encryption master key in case node was down during cluster master
>> key change.
>> In this case, node can’t join to the cluster, because it’s master key
>> differs from the cluster.
>> To recover node Ignite should locally change master key before join.
>>
>> Please, take a look into source code [1]
>>
>> [1]
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/managers/encryption/GridEncryptionManager.java#L710
>>
>>> 21 сент. 2020 г., в 14:37, Sergey Chugunov <sergey.chugu...@gmail.com>
>> написал(а):
>>>
>>> Ivan,
>>>
>>> Sorry for some confusion, MM indeed is not a normal mode. What I was
>> trying
>>> to say is that when in MM node still starts and allows the user to
>> perform
>>> actions with it like sending commands via control utility/JMX APIs or
>>> reading metrics.
>>>
>>> This is the key point: although the node is not in the cluster but it is
>>> still alive can be monitored and supports management to do maintenance.
>>>
>>> From the code complexity perspective I'm trying to design the feature in
>>> such a way that all maintenance code is as encapsulated as possible and
>>> avoids massive interventions into main workflows of components.
>>> At the moment I'm aware about two use cases for this feature: corrupted
>> PDS
>>> cleanup and defragmentation. As far as I know it won't bring too much
>>> complexity in both cases.
>>>
>>> I cannot say for other components but I believe it will be possible to
>>> integrate MM feature into their workflow as well with reasonable amount
>> of
>>> refactoring.
>>>
>>> Does it make sense to you?
>>>
>>> On Sun, Sep 6, 2020 at 8:08 AM Ivan Pavlukhin <vololo...@gmail.com>
>> wrote:
>>>
>>>> Sergey,
>>>>
>>>> Thank you for your answer!
>>>>
>>>> Might be I am looking at the subject from a different angle.
>>>>
>>>>> I think of a node in MM as an almost normal one
>>>> I cannot think of such a mode as a normal one, because it apparently
>>>> does not perform usual cluster node functions. It is not a part of a
>>>> cluster, caches data is not available, Discovery and Communication are
>>>> not needed.
>>>>
>>>> I fear that with "node started in a special mode" approach we will get
>>>> an additional flag in the code making the code more complex and
>>>> fragile. Should not I worry about it?
>>>>
>>>> 2020-09-02 10:45 GMT+03:00, Sergey Chugunov <sergey.chugu...@gmail.com
>>> :
>>>>> Vladislav, Ivan,
>>>>>
>>>>> Thank you for your questions and suggestions. Let me answer them.
>>>>>
>>>>> Vladislav,
>>>>>
>>>>> If I understood you correctly, you're talking about a node performing
>>>> some
>>>>> automatic actions to fix the problem and then join the cluster as
>> usual.
>>>>>
>>>>> However the original ticket [1] where we faced the need for Maintenance
>>>>> Mode is about exactly the opposite: avoid doing automatic actions and
>>>> give
>>>>> a user the ability to decide what to do.
>>>>>
>>>>> Also the idea of Maintenance Mode is that the node is able to accept
>>>>> commands, expose metrics and so on, thus we need all components to be
>>>>> initialized (some of them may be partially initialized due to their own
>>>>> maintenance).
>>>>> To achieve that we need to go through a full cycle of node
>> initialization
>>>>> including discovery initialization. When discovery is initialized (in
>>>>> special isolated mode) I don't think it is easy to switch back to
>> normal
>>>>> operations without a restart.
>>>>>
>>>>> Ivan,
>>>>>
>>>>> I think of a node in MM as an almost normal one (maybe with some
>>>> components
>>>>> skipped some steps of their initialization). Commands are accepted,
>>>>> appropriate metrics are exposed e.g. through JMX API and so on.
>>>>>
>>>>> So as I see it we'll have special commands for control.{sh|bat} CLI
>>>>> allowing user to see reasons why node switched to maintenance mode
>> and/or
>>>>> trigger actions to fix the problem (I'm still thinking about proper
>>>> design
>>>>> of these actions though).
>>>>>
>>>>> Of course the user should also be able to fix the problem manually e.g.
>>>> by
>>>>> manually deleting corrupted PDS files when node is down. Ideally
>>>>> Maintenance Mode should be smart enough to figure that out and switch
>> to
>>>>> normal operations without a restart but I'm not sure if it is possible
>>>>> without invasive changes of our components' lifecycle.
>>>>> So I believe this model (node truly started in Maintenance Mode and new
>>>>> commands in control.{sh|bat}) is a good fit for our current APIs and
>> ways
>>>>> to interact with the node.
>>>>>
>>>>> Does it sound reasonable to you?
>>>>>
>>>>> Thank you!
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-13366
>>>>>
>>>>> On Tue, Sep 1, 2020 at 2:07 PM Ivan Pavlukhin <vololo...@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Sergey,
>>>>>>
>>>>>> Actually, I missed the point that the discussed mode affects a single
>>>>>> node but not a whole cluster. Perhaps I mixed terms "mode" and
>>>>>> "state".
>>>>>>
>>>>>> My next thoughts about maintenance routines are about special
>>>>>> utilities. As far as I remember MySQL provides a bunch of scripts for
>>>>>> various maintenance purposes. What user interface for maintenance
>>>>>> tasks execution is assumed? And what do we mean by "starting" a node
>>>>>> in a maintenance mode? Can we do some routines without "starting"
>>>>>> (e.g. try to recover PDS or cleanup)?
>>>>>>
>>>>>> 2020-08-31 23:41 GMT+03:00, Vladislav Pyatkov <vldpyat...@gmail.com>:
>>>>>>> Hi Sergey.
>>>>>>>
>>>>>>> As I understand any switching from/to MM possible only through manual
>>>>>>> restart a node.
>>>>>>> But in your example that look like a technical actions, that only
>>>>>> possible
>>>>>>> in the case.
>>>>>>> Do you plan to provide a possibility for client where he can make a
>>>>>>> decision without a manual intervention?
>>>>>>>
>>>>>>> For example: Start node and manually agree with an option and after
>>>>>>> automatically resolve conflict and back to topology as a stable node.
>>>>>>>
>>>>>>> On Mon, Aug 31, 2020 at 5:41 PM Sergey Chugunov <
>>>>>> sergey.chugu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Ivan,
>>>>>>>>
>>>>>>>> Thank you for raising the good question, I didn't think of
>>>> Maintenance
>>>>>>>> Mode
>>>>>>>> from that perspective.
>>>>>>>>
>>>>>>>> In short, Maintenance Mode isn't related to Cluster States concept.
>>>>>>>> According to javadoc documentation of ClusterState enum [1] it is
>>>>>>>> solely
>>>>>>>> about cache operations and to some extent doesn't affect other
>>>>>> components
>>>>>>>> of Ignite node.
>>>>>>>> From APIs perspective putting the methods to manage Cluster State to
>>>>>>>> IgniteCluster interface doesn't look ideal to me but it is as it is.
>>>>>>>>
>>>>>>>> On the other hand Maintenance Mode as I see it will be managed
>>>> through
>>>>>>>> different APIs than a ClusterState and this difference definitely
>>>> will
>>>>>> be
>>>>>>>> reflected in the documentation of the feature.
>>>>>>>>
>>>>>>>> Ignite node is a complex piece of many components interacting with
>>>>>>>> each
>>>>>>>> other, they may have different lifecycles and states; states of
>>>>>> different
>>>>>>>> components cannot be reduced to the lowest common denominator.
>>>>>>>>
>>>>>>>> However if you have an idea of how to call the feature better to let
>>>>>>>> the
>>>>>>>> user easier distinguish it from other similar features please share
>>>> it
>>>>>>>> with
>>>>>>>> us. Personally I'm very welcome to any suggestions that make design
>>>>>>>> more
>>>>>>>> intuitive and easy-to-use.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> [1]
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/cluster/ClusterState.java
>>>>>>>>
>>>>>>>> On Mon, Aug 31, 2020 at 12:32 PM Ivan Pavlukhin <
>> vololo...@gmail.com
>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Sergey,
>>>>>>>>>
>>>>>>>>> Thank you for bringing attention to that important subject!
>>>>>>>>>
>>>>>>>>> My note here is about one more cluster mode. As far as I know
>>>>>>>>> currently we already have 3 modes (inactive, read-only, read-write)
>>>>>>>>> and the subject is about one more. From the first glance it could
>>>> be
>>>>>>>>> hard for a user to understand and use all modes properly. Do we
>>>>>>>>> really
>>>>>>>>> need all spectrum? Could we simplify things somehow?
>>>>>>>>>
>>>>>>>>> 2020-08-27 15:59 GMT+03:00, Sergey Chugunov
>>>>>>>>> <sergey.chugu...@gmail.com>:
>>>>>>>>>> Hello Nikolay,
>>>>>>>>>>
>>>>>>>>>> Created one, available by link [1]
>>>>>>>>>>
>>>>>>>>>> Initially there was an intention to develop it under IEP-47 [2]
>>>>>>>>>> and
>>>>>>>> there
>>>>>>>>>> is even a separate section for Maintenance Mode there.
>>>>>>>>>> But it looks like this feature is useful in more cases and
>>>>>>>>>> deserves
>>>>>>>>>> its
>>>>>>>>> own
>>>>>>>>>> IEP.
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-53%3A+Maintenance+Mode
>>>>>>>>>> [2]
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 27, 2020 at 11:01 AM Nikolay Izhikov
>>>>>>>>>> <nizhi...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello, Sergey!
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the proposal.
>>>>>>>>>>> Let’s have IEP for this feature.
>>>>>>>>>>>
>>>>>>>>>>>> 27 авг. 2020 г., в 10:25, Sergey Chugunov <
>>>>>>>> sergey.chugu...@gmail.com>
>>>>>>>>>>> написал(а):
>>>>>>>>>>>>
>>>>>>>>>>>> Hello Igniters,
>>>>>>>>>>>>
>>>>>>>>>>>> I want to start a discussion about new supporting feature that
>>>>>>>>>>>> could
>>>>>>>>> be
>>>>>>>>>>>> very useful in many scenarios where persistent storage is
>>>>>>>>>>>> involved:
>>>>>>>>>>>> Maintenance Mode.
>>>>>>>>>>>>
>>>>>>>>>>>> *Summary*
>>>>>>>>>>>> Maintenance Mode (MM for short) is a special state of Ignite
>>>>>>>>>>>> node
>>>>>>>> when
>>>>>>>>>>> node
>>>>>>>>>>>> doesn't serve user requests nor joins the cluster but waits
>>>> for
>>>>>>>>>>>> user
>>>>>>>>>>>> commands or performs automatic actions for maintenance
>>>>>>>>>>>> purposes.
>>>>>>>>>>>>
>>>>>>>>>>>> *Motivation*
>>>>>>>>>>>> There are situations when node cannot participate in regular
>>>>>>>>> operations
>>>>>>>>>>> but
>>>>>>>>>>>> at the same time should not be shut down.
>>>>>>>>>>>>
>>>>>>>>>>>> One example is a ticket [1] where I developed the first draft
>>>>>>>>>>>> of
>>>>>>>>>>>> Maintenance Mode.
>>>>>>>>>>>> Here we get into a situation when node has potentially
>>>>>>>>>>>> corrupted
>>>>>>>>>>>> PDS
>>>>>>>>>>>> thus
>>>>>>>>>>>> cannot proceed with restore routine and join the cluster as
>>>>>> usual.
>>>>>>>>>>>> At the same time node should not fail nor be stopped for
>>>> manual
>>>>>>>>>>>> cleanup.
>>>>>>>>>>>> Manual cleanup is not always an option (e.g. restricted access
>>>>>>>>>>>> to
>>>>>>>> file
>>>>>>>>>>>> system); in managed environments failed node will be restarted
>>>>>>>>>>>> automatically so user won't have time for performing necessary
>>>>>>>>>>> operations.
>>>>>>>>>>>> Thus node needs to function in a special mode allowing user to
>>>>>>>> connect
>>>>>>>>>>>> to
>>>>>>>>>>>> it and perform necessary actions.
>>>>>>>>>>>>
>>>>>>>>>>>> Another example is described in IEP-47 [2] where
>>>>>>>>>>>> defragmentation
>>>>>>>>>>>> is
>>>>>>>>>>>> being
>>>>>>>>>>>> developed. Node defragmenting its PDS should not join the
>>>>>>>>>>>> cluster
>>>>>>>>> until
>>>>>>>>>>> the
>>>>>>>>>>>> process is finished so it needs to enter Maintenance Mode as
>>>>>> well.
>>>>>>>>>>>>
>>>>>>>>>>>> *Suggested design*
>>>>>>>>>>>> I suggest MM to work as follows:
>>>>>>>>>>>> 1. Node enters MM if special markers are found on disk. These
>>>>>>>> markers
>>>>>>>>>>>> called Maintenance Records could be created automatically
>>>> (e.g.
>>>>>>>>>>>> when
>>>>>>>>>>>> storage component detects corrupted storage) or by user
>>>> request
>>>>>>>> (when
>>>>>>>>>>> user
>>>>>>>>>>>> requests defragmentation of some caches). So entering MM
>>>>>>>>>>>> requires
>>>>>>>> node
>>>>>>>>>>>> restart.
>>>>>>>>>>>> 2. Started in MM node doesn't join the cluster but finishes
>>>>>>>>>>>> startup
>>>>>>>>>>> routine
>>>>>>>>>>>> so it is able to receive commands and provide metrics to the
>>>>>> user.
>>>>>>>>>>>> 3. When all necessary maintenance operations are finished,
>>>>>>>> Maintenance
>>>>>>>>>>>> Records for these operations are deleted from disk and node
>>>>>>>> restarted
>>>>>>>>>>> again
>>>>>>>>>>>> to enter normal service.
>>>>>>>>>>>>
>>>>>>>>>>>> *Example*
>>>>>>>>>>>> To put it into a context let's consider an example of how I
>>>> see
>>>>>>>>>>>> the
>>>>>>>> MM
>>>>>>>>>>>> workflow in case of PDS corruption.
>>>>>>>>>>>>
>>>>>>>>>>>> 1. Node has failed in the middle of checkpoint when WAL is
>>>>>>>> disabled
>>>>>>>>>>>> for
>>>>>>>>>>>> a particular cache -> data files of the cache are
>>>> potentially
>>>>>>>>>>> corrupted.
>>>>>>>>>>>> 2. On next startup node detects this situation, creates
>>>>>>>> Maintenance
>>>>>>>>>>>> Record on disk and shuts down.
>>>>>>>>>>>> 3. On next startup node sees Maintenance Record, enters
>>>>>>>> Maintenance
>>>>>>>>>>> Mode
>>>>>>>>>>>> and waits for user to do specific actions: clean potentially
>>>>>>>>>>>> corrupted
>>>>>>>>>>> PDS.
>>>>>>>>>>>> 4. When user has done necessary actions he/she removes
>>>>>>>>>>>> Maintenance
>>>>>>>>>>>> Record using Maintenance Mode API exposed via
>>>>>>>>>>>> control.{sh|bat}
>>>>>>>>> script
>>>>>>>>>>> or
>>>>>>>>>>>> JMX.
>>>>>>>>>>>> 5. On next startup node goes to normal operations as
>>>>>> maintenance
>>>>>>>>>>>> reason
>>>>>>>>>>>> is fixed.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I prepared a PR [3] for ticket [1] with draft implementation.
>>>>>>>>>>>> It
>>>>>>>>>>>> is
>>>>>>>>> not
>>>>>>>>>>>> ready to be merged to master branch but is already fully
>>>>>>>>>>>> functional
>>>>>>>>> and
>>>>>>>>>>> can
>>>>>>>>>>>> be reviewed.
>>>>>>>>>>>>
>>>>>>>>>>>> Hope you'll share your feedback on the feature and/or any
>>>>>> thoughts
>>>>>>>> on
>>>>>>>>>>>> implementation.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you!
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://issues.apache.org/jira/browse/IGNITE-13366
>>>>>>>>>>>> [2]
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-47:+Native+persistence+defragmentation
>>>>>>>>>>>> [3] https://github.com/apache/ignite/pull/8189
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Ivan Pavlukhin
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Vladislav Pyatkov
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Best regards,
>>>>>> Ivan Pavlukhin
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Best regards,
>>>> Ivan Pavlukhin
>>>>
>>
>>