Thanks for the feedback! Using the `execution.checkpointing.incremental.enabled`, and enabling it by default sounds good to me.
Best, Rui On Wed, Jan 3, 2024 at 11:10 AM Zakelly Lan <zakelly....@gmail.com> wrote: > Hi Rui, > > Thanks for your comments! > > IMO, given that the state backend can be plugably loaded (as you can > specify a state backend factory), I prefer not providing state backend > specified options in the framework. > > Secondly, the incremental checkpoint is actually a sharing file strategy > across checkpoints, which means the state backend *could* reuse files from > previous cp but not *must* do so. When the state backend could not reuse > the files, it is reasonable to fallback to a full checkpoint. > > Thus, I suggest we make it `execution.checkpointing.incremental` and enable > it by default. For those state backends not supporting this, they perform > full checkpoints and print a warning to inform users. Users do not need to > pay special attention to different options to control this across different > state backends. This is more user-friendly in my opinion. WDYT? > > On Tue, Jan 2, 2024 at 10:49 AM Rui Fan <1996fan...@gmail.com> wrote: > > > Hi Zakelly, > > > > I'm not sure whether we could add the state backend type in the > > new key name of state.backend.incremental. It means we use > > `execution.checkpointing.rocksdb-incremental` or > > `execution.checkpointing.rocksdb-incremental.enabled`. > > > > So far, state.backend.incremental only works for rocksdb state backend. > > And this feature or optimization is very valuable and huge for large > > state flink jobs. I believe it's enabled for most production flink jobs > > with large rocksdb state. > > > > If this option isn't generic for all state backend types, I guess we > > can enable `execution.checkpointing.rocksdb-incremental.enabled` > > by default in Flink 2.0. > > > > But if it works for all state backends, it's hard to enable it by > default. > > Enabling great and valuable features or improvements are useful > > for users, especially a lot of new flink users. Out-of-the-box options > > are good for users. > > > > WDYT? > > > > Best, > > Rui > > > > On Fri, Dec 29, 2023 at 1:45 PM Zakelly Lan <zakelly....@gmail.com> > wrote: > > > > > Hi everyone, > > > > > > Thanks all for your comments! > > > > > > As many of you have questions about the names for boolean options, I > > > suggest we make a naming rule for them. For now I could think of three > > > options: > > > > > > Option 1: Use enumeration options if possible. But this may cause some > > name > > > collisions or confusion as we discussed and we should unify the > statement > > > everywhere. > > > Option 2: Use boolean options and add 'enabled' as the suffix. > > > Option 3: Use boolean options and ONLY add 'enabled' when there are > more > > > detailed configurations under the same prefix, to prevent one name from > > > serving as a prefix to another. > > > > > > I am slightly inclined to Option 3, since it is more in line with > current > > > practice and friendly for existing users. Also It reduces the length of > > > configuration names as much as possible. I really want to hear your > > > opinions. > > > > > > > > > @Xuannan > > > > > > I agree with your comments 1 and 3. > > > > > > For 2, If we decide to change the name, maybe > > > `execution.checkpointing.parallel-cleaner` is better? And as for > whether > > to > > > add 'enabled' I suggest we discuss the rule above. WDYT? > > > Thanks! > > > > > > > > > Best, > > > Zakelly > > > > > > On Fri, Dec 29, 2023 at 12:02 PM Xuannan Su <suxuanna...@gmail.com> > > wrote: > > > > > > > Hi Zakelly, > > > > > > > > Thanks for driving this! The organization of the configuration option > > > > in the FLIP looks much cleaner and easier to understand. +1 to the > > > > FLIP. > > > > > > > > Just some questions from me. > > > > > > > > 1. I think the change to the ConfigOptions should be put in the > > > > `Public Interface` section, instead of `Proposed Changed`, as those > > > > configuration options are public interface. > > > > > > > > 2. The key `state.checkpoint.cleaner.parallel-mode` seems confusing. > > > > It feels like it is used to choose different modes. In fact, it is a > > > > boolean flag to indicate whether to enable parallel clean. How about > > > > making it `state.checkpoint.cleaner.parallel-mode.enabled`? > > > > > > > > 3. The `execution.checkpointing.write-buffer` may better be > > > > `execution.checkpointing.write-buffer-size` so that we know it is > > > > configuring the size of the buffer. > > > > > > > > Best, > > > > Xuannan > > > > > > > > > > > > On Wed, Dec 27, 2023 at 7:17 PM Yanfei Lei <fredia...@gmail.com> > > wrote: > > > > > > > > > > Hi Zakelly, > > > > > > > > > > > Considering the name occupation, how about naming it as > > > > `execution.checkpointing.type`? > > > > > > > > > > `Checkpoint Type`[1,2] is used to describe aligned/unaligned > > > > > checkpoint, I am inclined to make a choice between > > > > > `execution.checkpointing.incremental` and > > > > > `execution.checkpointing.incremental.enabled`. > > > > > > > > > > > > > > > [1] > > > > > > > > > > https://nightlies.apache.org/flink/flink-docs-release-1.18/docs/ops/monitoring/checkpoint_monitoring/ > > > > > [2] > > > > > > > > > > https://github.com/apache/flink/blob/master/flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/detail/job-checkpoints-detail.component.html#L27 > > > > > > > > > > -- > > > > > Best, > > > > > Yanfei > > > > > > > > > > Zakelly Lan <zakelly....@gmail.com> 于2023年12月27日周三 14:41写道: > > > > > > > > > > > > Hi Lijie, > > > > > > > > > > > > Thanks for the reminder! I missed this. > > > > > > > > > > > > Considering the name occupation, how about naming it as > > > > > > `execution.checkpointing.type`? > > > > > > > > > > > > Actually I think the current `execution.checkpointing.mode` is > > > > confusing in > > > > > > some ways, maybe `execution.checkpointing.data-consistency` is > > > better. > > > > > > > > > > > > > > > > > > Best, > > > > > > Zakelly > > > > > > > > > > > > > > > > > > On Wed, Dec 27, 2023 at 12:59 PM Lijie Wang < > > > wangdachui9...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > Hi Zakelly, > > > > > > > > > > > > > > >> I'm wondering if `execution.checkpointing.savepoint-dir` > would > > > be > > > > > > > better. > > > > > > > > > > > > > > `execution.checkpointing.dir` and > > > > `execution.checkpointing.savepoint-dir` > > > > > > > are also fine for me. > > > > > > > > > > > > > > >> So I think an enumeration option > > `execution.checkpointing.mode` > > > > which > > > > > > > can be 'full' (default) or 'incremental' would be better > > > > > > > > > > > > > > I agree with using an enumeration option. But currently there > is > > > > already a > > > > > > > configuration option called `execution.checkpointing.mode`, > which > > > is > > > > used > > > > > > > to choose EXACTLY_ONCE or AT_LEAST_ONCE. Maybe we need to use > > > > another name > > > > > > > or merge these two options. > > > > > > > > > > > > > > Best, > > > > > > > Lijie > > > > > > > > > > > > > > Zakelly Lan <zakelly....@gmail.com> 于2023年12月27日周三 11:43写道: > > > > > > > > > > > > > > > Hi everyone, > > > > > > > > > > > > > > > > Thanks all for your comments! > > > > > > > > > > > > > > > > @Yanfei > > > > > > > > > > > > > > > > > 1. For some state backends that do not support incremental > > > > checkpoint, > > > > > > > > > how does the execution.checkpointing.incrementaloption take > > > > effect? Or > > > > > > > > > is it better to put incremental under > > > > state.backend.xxx.incremental? > > > > > > > > > > > > > > > > > I'd rather not put the option for incremental checkpoint > under > > > the > > > > > > > > 'state.backend', since it is more about the checkpointing > > instead > > > > of > > > > > > > state > > > > > > > > accessing. Of course, the state backend may not necessarily > do > > > > > > > incremental > > > > > > > > checkpoint as requested. If the state backend is not capable > of > > > > taking > > > > > > > > incremental cp, it is better to fallback to the full cp. > > > > > > > > > > > > > > > > 2. I'm a little worried that putting all configurations into > > > > > > > > > `ExecutionCheckpointingOptions` will introduce some > > dependency > > > > > > > > > problems. Some options would be used by flink-runtime > module, > > > but > > > > > > > > > flink-runtime should not depend on flink-streaming-java. > e.g. > > > > > > > > > FLINK-28286[1]. > > > > > > > > > So, I prefer to move configurations to > > `CheckpointingOptions`, > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > > > > > Yes, that's a very good point. Moving to > > > > > > > > `CheckpointingOptions`(flink-core) makes sense. > > > > > > > > > > > > > > > > @Lijie > > > > > > > > > > > > > > > > How about > > > > > > > > > state.savepoints.dir -> > execution.checkpointing.savepoint.dir > > > > > > > > > state.checkpoints.dir -> > > execution.checkpointing.checkpoint.dir > > > > > > > > > > > > > > > > > > > > > > > > Actually, I think the `checkpointing.checkpoint` may cause > some > > > > > > > confusion. > > > > > > > > But I'm ok if others agree. > > > > > > > > I'm wondering if `execution.checkpointing.savepoint-dir` > would > > be > > > > better. > > > > > > > > WDYT? > > > > > > > > > > > > > > > > 2. We changed the execution.checkpointing.local-copy' to > > > > > > > > > 'execution.checkpointing.local-copy.enabled'. Should we > also > > > add > > > > > > > > "enabled" > > > > > > > > > suffix for other boolean type configuration options ? For > > > > example, > > > > > > > > > execution.checkpointing.incremental -> > > > > > > > > > execution.checkpointing.incremental.enabled > > > > > > > > > > > > > > > > > > > > > > > > > Actually, the incremental cp is something like choosing a > mode > > > for > > > > doing > > > > > > > > checkpoint instead of enabling a function. So I think an > > > > enumeration > > > > > > > option > > > > > > > > `execution.checkpointing.mode` which can be 'full' (default) > or > > > > > > > > 'incremental' would be better, WDYT? > > > > > > > > And @Rui Fan @Yanfei What do you think about this? > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Dec 26, 2023 at 5:15 PM Lijie Wang < > > > > wangdachui9...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Hi Zakelly, > > > > > > > > > > > > > > > > > > Thanks for driving the discussion. > > > > > > > > > > > > > > > > > > 1. > > > > > > > > > >> But I'm not so sure since there is only one > > > savepoint-related > > > > > > > option. > > > > > > > > > Maybe someone else could share some thoughts here. > > > > > > > > > > > > > > > > > > How about > > > > > > > > > state.savepoints.dir -> > execution.checkpointing.savepoint.dir > > > > > > > > > state.checkpoints.dir -> > > execution.checkpointing.checkpoint.dir > > > > > > > > > > > > > > > > > > 2. We changed the execution.checkpointing.local-copy' to > > > > > > > > > 'execution.checkpointing.local-copy.enabled'. Should we > also > > > add > > > > > > > > "enabled" > > > > > > > > > suffix for other boolean type configuration options ? For > > > > example, > > > > > > > > > execution.checkpointing.incremental -> > > > > > > > > > execution.checkpointing.incremental.enabled > > > > > > > > > > > > > > > > > > In this way, the naming style of configuration options is > > > > unified, and > > > > > > > it > > > > > > > > > can avoid potential similar problems (for example, we may > > need > > > > to add > > > > > > > > more > > > > > > > > > options for incremental checkpoint in the future). > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Lijie > > > > > > > > > > > > > > > > > > Yanfei Lei <fredia...@gmail.com> 于2023年12月26日周二 12:05写道: > > > > > > > > > > > > > > > > > > > Hi Zakelly, > > > > > > > > > > > > > > > > > > > > Thank you for creating the FLIP and starting the > > discussion. > > > > > > > > > > > > > > > > > > > > The current arrangement of these options is indeed > somewhat > > > > > > > haphazard, > > > > > > > > > > and the new arrangement looks much better. I have some > > > > questions > > > > > > > about > > > > > > > > > > the arrangement of some new configuration options: > > > > > > > > > > > > > > > > > > > > 1. For some state backends that do not support > incremental > > > > > > > checkpoint, > > > > > > > > > > how does the execution.checkpointing.incrementaloption > take > > > > effect? > > > > > > > Or > > > > > > > > > > is it better to put incremental under > > > > state.backend.xxx.incremental? > > > > > > > > > > > > > > > > > > > > 2. I'm a little worried that putting all configurations > > into > > > > > > > > > > `ExecutionCheckpointingOptions` will introduce some > > > dependency > > > > > > > > > > problems. Some options would be used by flink-runtime > > module, > > > > but > > > > > > > > > > flink-runtime should not depend on flink-streaming-java. > > e.g. > > > > > > > > > > FLINK-28286[1]. > > > > > > > > > > So, I prefer to move configurations to > > > `CheckpointingOptions`, > > > > WDYT? > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/FLINK-28286 > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best, > > > > > > > > > > Yanfei > > > > > > > > > > > > > > > > > > > > Zakelly Lan <zakelly....@gmail.com> 于2023年12月25日周一 > > 21:14写道: > > > > > > > > > > > > > > > > > > > > > > Hi Rui Fan and Junrui, > > > > > > > > > > > > > > > > > > > > > > Thanks for the reminder! I agree to change the > > > > > > > > > > > 'execution.checkpointing.local-copy' to > > > > > > > > > > > 'execution.checkpointing.local-copy.enabled'. > > > > > > > > > > > > > > > > > > > > > > And for other suggestions Rui proposed: > > > > > > > > > > > > > > > > > > > > > > 1. How about execution.checkpointing.storage.type > instead > > > > > > > > > > > > of execution.checkpointing.storage? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ah, I missed something here. Actually I suggest we > could > > > > merge the > > > > > > > > > > current > > > > > > > > > > > 'state.checkpoints.dir' and 'state.checkpoint-storage' > > into > > > > one URI > > > > > > > > > > > configuration named 'execution.checkpointing.dir'. > WDYT? > > > > > > > > > > > > > > > > > > > > > > 3. execution.checkpointing.savepoint.dir is a little > > weird. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Yes, I think it is better to make 'savepoint' and > > > > 'checkpoint' the > > > > > > > > same > > > > > > > > > > > level. But I'm not so sure since there is only one > > > > > > > savepoint-related > > > > > > > > > > > option. Maybe someone else could share some thoughts > > here. > > > > > > > > > > > > > > > > > > > > > > 4. How about execution.recovery.claim-mode instead of > > > > > > > > > > > > execution.recovery.mode? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Agreed. That's more accurate. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Many thanks for your suggestions! > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Zakelly > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 25, 2023 at 8:18 PM Junrui Lee < > > > > jrlee....@gmail.com> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > Hi Zakelly, > > > > > > > > > > > > > > > > > > > > > > > > Thanks for driving this. I agree that the proposed > > > > restructuring > > > > > > > of > > > > > > > > > the > > > > > > > > > > > > configuration options is largely positive. It will > make > > > > > > > > understanding > > > > > > > > > > and > > > > > > > > > > > > working with Flink configurations more intuitive. > > > > > > > > > > > > > > > > > > > > > > > > Most of the proposed changes look great. Just a > > heads-up, > > > > as Rui > > > > > > > > Fan > > > > > > > > > > > > mentioned, Flink currently requires that no > > > configOption's > > > > key be > > > > > > > > the > > > > > > > > > > > > prefix of another to avoid issues when we eventually > > > adopt > > > > a > > > > > > > > standard > > > > > > > > > > YAML > > > > > > > > > > > > parser, as detailed in FLINK-29372 ( > > > > > > > > > > > > https://issues.apache.org/jira/browse/FLINK-29372). > > > > Therefore, > > > > > > > > it's > > > > > > > > > > better > > > > > > > > > > > > to change the key > 'execution.checkpointing.local-copy' > > > > because it > > > > > > > > > > serves as > > > > > > > > > > > > a prefix to the key > > > > 'execution.checkpointing.local-copy.dir'. > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Junrui > > > > > > > > > > > > > > > > > > > > > > > > Rui Fan <1996fan...@gmail.com> 于2023年12月25日周一 > 19:11写道: > > > > > > > > > > > > > > > > > > > > > > > > > Hi Zakelly, > > > > > > > > > > > > > > > > > > > > > > > > > > Thank you for driving this proposal! > > > > > > > > > > > > > > > > > > > > > > > > > > Overall good for me. I have some questions about > > these > > > > names. > > > > > > > > > > > > > > > > > > > > > > > > > > 1. How about execution.checkpointing.storage.type > > > > instead of > > > > > > > > > > > > > execution.checkpointing.storage? > > > > > > > > > > > > > > > > > > > > > > > > > > It's similar to state.backend.type. > > > > > > > > > > > > > > > > > > > > > > > > > > 2. How about > > execution.checkpointing.local-copy.enabled > > > > instead > > > > > > > > of > > > > > > > > > > > > > execution.checkpointing.local-copy? > > > > > > > > > > > > > > > > > > > > > > > > > > You added a new option: > > > > execution.checkpointing.local-copy.dir. > > > > > > > > > > > > > IIUC, one option name shouldn't be the prefix of > > other > > > > options. > > > > > > > > > > > > > If you add a new option > > > > execution.checkpointing.local-copy, > > > > > > > > > > > > > flink CI will fail directly. > > > > > > > > > > > > > > > > > > > > > > > > > > 3. execution.checkpointing.savepoint.dir is a > little > > > > weird. > > > > > > > > > > > > > > > > > > > > > > > > > > For old options: state.savepoints.dir and > > > > > > > state.checkpoints.dir, > > > > > > > > > > > > > the savepoint and checkpoint are the same level. It > > > means > > > > > > > > > > > > > it's a checkpoint or savepoint. > > > > > > > > > > > > > > > > > > > > > > > > > > The new option execution.checkpointing.dir is fine > > for > > > > me. > > > > > > > > > > > > > However, execution.checkpointing.savepoint.dir is a > > > > little > > > > > > > weird. > > > > > > > > > > > > > I don't know which name is better now. Let us think > > > > about it > > > > > > > > more. > > > > > > > > > > > > > > > > > > > > > > > > > > 4. How about execution.recovery.claim-mode instead > of > > > > > > > > > > > > > execution.recovery.mode? > > > > > > > > > > > > > > > > > > > > > > > > > > The meaning of mode is too broad. The claim-mode > may > > > > > > > > > > > > > be more accurate for users. > > > > > > > > > > > > > > > > > > > > > > > > > > WDYT? > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > Rui > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Dec 25, 2023 at 5:14 PM Zakelly Lan < > > > > > > > > zakelly....@gmail.com > > > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi devs, > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd like to start a discussion on FLIP-406: > > > Reorganize > > > > State > > > > > > > & > > > > > > > > > > > > > > Checkpointing & Recovery Configuration[1]. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, the configuration options pertaining > to > > > > > > > > checkpointing, > > > > > > > > > > > > > recovery, > > > > > > > > > > > > > > and state management are primarily grouped under > > the > > > > > > > following > > > > > > > > > > > > prefixes: > > > > > > > > > > > > > > > > > > > > > > > > > > > > - state.backend.* : configurations related to > > > state > > > > > > > > accessing > > > > > > > > > > and > > > > > > > > > > > > > > checkpointing, as well as specific options for > > > > individual > > > > > > > > > state > > > > > > > > > > > > > backends > > > > > > > > > > > > > > - execution.checkpointing.* : configurations > > > > associated > > > > > > > with > > > > > > > > > > > > > checkpoint > > > > > > > > > > > > > > execution and recovery > > > > > > > > > > > > > > - execution.savepoint.*: configurations for > > > > recovery from > > > > > > > > > > savepoint > > > > > > > > > > > > > > > > > > > > > > > > > > > > In addition, there are several individual options > > > such > > > > as ' > > > > > > > > > > > > > > *state.checkpoint-storage*' and > > > > '*state.checkpoints.dir*' > > > > > > > that > > > > > > > > > fall > > > > > > > > > > > > > outside > > > > > > > > > > > > > > of these prefixes. The current arrangement of > these > > > > options, > > > > > > > > > which > > > > > > > > > > span > > > > > > > > > > > > > > multiple modules, is somewhat haphazard and > lacks a > > > > > > > systematic > > > > > > > > > > > > structure. > > > > > > > > > > > > > > For example, the options under the > > > > '*CheckpointingOptions*' > > > > > > > > and ' > > > > > > > > > > > > > > *ExecutionCheckpointingOptions*' are related and > > have > > > > no > > > > > > > clear > > > > > > > > > > > > boundaries > > > > > > > > > > > > > > from the user's perspective, but there is no > > unified > > > > prefix > > > > > > > for > > > > > > > > > > them. > > > > > > > > > > > > > With > > > > > > > > > > > > > > the upcoming release of Flink 2.0, we have an > > > excellent > > > > > > > > > > opportunity to > > > > > > > > > > > > > > overhaul and restructure the configurations > related > > > to > > > > > > > > > > checkpointing, > > > > > > > > > > > > > > recovery, and state management. This FLIP > proposes > > to > > > > > > > > reorganize > > > > > > > > > > these > > > > > > > > > > > > > > settings, making it more coherent by module, > which > > > > would > > > > > > > > > > significantly > > > > > > > > > > > > > > lower the barriers for understanding and reduce > the > > > > > > > development > > > > > > > > > > costs > > > > > > > > > > > > > > moving forward. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to hearing from you! > > > > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=284789560 > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Zakelly > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >