Thank you, Stan, for the detailed feedback. I totally agree with the "dimensions" stuff, I think that's what I tried to describe in the original proposal. I still think that we can get away with a single enum, because it at least covers the current use cases. In any case, it will be safe to add more values to it if needed. So, if everyone agrees with the proposed enum (with three values), I'll prepare a ticket and will start working on it some time in the future.
It would still be nice to get more feedback, though, because this improvement may affect some users whose use cases I'm not aware of. On Sun, Oct 30, 2022 at 9:46 PM Stanislav Lukyanov <stanlukya...@gmail.com> wrote: > Hi Aleksandr, > > Thanks for driving this. Someone should put a stop to the shutdown > behavior nonsense :) > > But let me complicate things a bit for you. > > I see multiple dimensions that are in play when node shutdown happens: > > - Running operations (Compute, SQL, Cache): interrupt or wait for > completion? If we wait r the for how long? > - Persistence checkpoint: just shutdown? wait for completion if running? > trigger and wait for completion? If we wait then for how long? > - Partition loss: just shutdown? wait for more copies to avoid partition > loss? is behavior the same for in-mem and persistence? is behavior the same > for backups=0? > - (anything else?) > > Ideally, we should first analyse each of these three dimensions and > understand what are the per-dimensions behaviours we can come up with - > what is reasonable. > Then, we can try combine the dimensions: introduce an enum of policies > that picks one behavior for each dimension. > I guess, it should be a table like > > Policy | Operations | Checkpoint | > Partition Loss > IMMEDIATE | Cancel | Cancel | Ignore > GRACEFUL | Wait forever | Trigger and wait forever | Wait forever > > This table should be a part of the Javadoc and the website docs. > > On the other hand, you might want to separate the policies and control the > running operations, the checkpoint, and the rebalance/partition loss > separately. But then it won't solve the initial mess probably? > > Thoughts? > > Thanks, > Stan > > > On 25 Oct 2022, at 17:17, Aleksandr Polovtsev <alexpolovt...@gmail.com> > wrote: > > > > Guys, any more feedback? > > > > On Tue, Oct 18, 2022 at 12:42 PM Aleksandr Polovtsev < > > alexpolovt...@gmail.com> wrote: > > > >> My guess is that it reduces the node shutdown time. We may want to keep > it > >> for compatibility purposes, because I'm afraid that changing the default > >> behaviour is not that easy. > >> > >> On Tue, Oct 18, 2022 at 12:32 PM Николай Ижиков <nizhi...@apache.org> > >> wrote: > >> > >>> So why it’s default behavior? > >>> And why we want keep it? > >>> > >>>> 18 окт. 2022 г., в 12:27, Aleksandr Polovtsev < > alexpolovt...@gmail.com> > >>> написал(а): > >>>> > >>>> But this is the default behaviour right now, do you mean we should > >>> change > >>>> it? I'm ok with that if it will not break something > >>>> > >>>> On Tue, Oct 18, 2022 at 12:13 PM Николай Ижиков <nizhi...@apache.org> > >>> wrote: > >>>> > >>>>>> I don't know what most people use in the real world, maybe more > >>>>> experienced guys can clarify that. > >>>>> > >>>>> My question is different. > >>>>> > >>>>> Why we may want to provide cancel=true? > >>>>> Is there any reason to support this in Ignite? > >>>>> > >>>>>> 18 окт. 2022 г., в 11:47, Aleksandr Polovtsev < > >>> alexpolovt...@gmail.com> > >>>>> написал(а): > >>>>>> > >>>>>> Thank you for the response, Nikolay. As I can see, this is the > default > >>>>>> behaviour now (ShutdownPolicy = IMMEDIATE, cancel = true). I don't > >>> know > >>>>>> what most people use in the real world, maybe more experienced guys > >>> can > >>>>>> clarify that. > >>>>>> > >>>>>> On Tue, Oct 18, 2022 at 11:15 AM Николай Ижиков < > nizhi...@apache.org> > >>>>> wrote: > >>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>> Why do we need «immediate»? > >>>>>>> Is there real-world scenarios for it? > >>>>>>> > >>>>>>> Can we do graceful or «save all possible data» shutdown always? > >>>>>>> > >>>>>>>> 18 окт. 2022 г., в 10:59, Aleksandr Polovtsev < > >>> alexpolovt...@gmail.com > >>>>>> > >>>>>>> написал(а): > >>>>>>>> > >>>>>>>> Hello, dear Igniters! > >>>>>>>> > >>>>>>>> I'd like to propose a refactoring of the current Shutdown Policy > >>>>>>> mechanism. > >>>>>>>> Currently, node shutdown is controlled by a bunch of flags and > >>> enums (I > >>>>>>> may > >>>>>>>> be missing some of them): > >>>>>>>> > >>>>>>>> 1. ShutdownPolicy enum has only two entries and is basically a > flag > >>>>> that > >>>>>>>> dictates if we need to wait for the data to be backed up, so that > we > >>>>> will > >>>>>>>> not lose any data when the node goes offline. > >>>>>>>> 2. "cancel" flag is passed to the stop method and indicates that > all > >>>>> long > >>>>>>>> running jobs should be interrupted. For example, it prevents the > >>> node > >>>>>>>> performing a checkpoint on shutdown. I'm pretty sure this flag is > >>>>> nearly > >>>>>>>> always set to "true", because it is used by the Shutdown Hook. > >>>>>>>> 3. There are also a bunch of system properties that affect the > >>>>> shutdown: > >>>>>>>> * IGNITE_PDS_SKIP_CHECKPOINT_ON_NODE_STOP - disables checkpoints > on > >>>>>>>> node stop, which heavily correlates with the "cancel" flag. > >>>>>>>> * IGNITE_WAIT_FOR_BACKUPS_ON_SHUTDOWN - deprecated and duplicates > >>> the > >>>>>>>> ShutdownPolicy enum > >>>>>>>> * IGNITE_NO_SHUTDOWN_HOOK - disables the shutdown hook. I don't > >>> know > >>>>>>> if > >>>>>>>> it is really useful. > >>>>>>>> > >>>>>>>> I think that this approach is counter intuitive and would like to > >>>>>>>> incorporate all aforementioned flags into the ShutdownPolicy enum. > >>> For > >>>>>>>> example, it can look as follows: > >>>>>>>> > >>>>>>>> public enum ShutdownPolicy { > >>>>>>>> IMMEDIATE // Stops the node and cancels all running jobs > >>>>>>>> GRACEFUL // Stops the node, does not cancel running jobs and waits > >>>>> for > >>>>>>>> backups > >>>>>>>> GRACEFUL_NO_BACKUPS // Stops the node, does not cancel running > >>> jobs, > >>>>>>>> does not wait for backups > >>>>>>>> } > >>>>>>>> > >>>>>>>> After that, the "cancel" flag and all corresponding properties can > >>> be > >>>>>>>> removed (apart from the IGNITE_NO_SHUTDOWN_HOOK, if it is still > >>>>> needed). > >>>>>>>> > >>>>>>>> Does this make sense? What do you think? > >>>>>>>> > >>>>>>>> -- > >>>>>>>> With regards, > >>>>>>>> Aleksandr Polovtsev > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> -- > >>>>>> With regards, > >>>>>> Aleksandr Polovtsev > >>>>> > >>>>> > >>>> > >>>> -- > >>>> With regards, > >>>> Aleksandr Polovtsev > >>> > >>> > >> > >> -- > >> With regards, > >> Aleksandr Polovtsev > >> > > > > > > -- > > With regards, > > Aleksandr Polovtsev > > -- With regards, Aleksandr Polovtsev