Thanks Matthias for driving this proposal! This proposal can reduce the amount of data that is processed repeatedly after rescaling, so this proposal makes sense to me.
I have some questions: 1. The public change only includes the "New Configuration Parameters" part, right? 2. jobmanager.adaptive-scheduler.rescale-on-failed-checkpoints-count is obviously a config option for users. But I'm not sure whether jobmanager.adaptive-scheduler.max-delay-for-rescale-trigger is a config option or an internal logic? I saw it's computed by rescale-on-failed-checkpoints-count. 3. I'm not sure if the default value of rescale-on-failed-checkpoints-count should be 1 or is greater than 1 better? If 1 as the default value, when the checkpoint fails occasionally, and rescale happens, flink job will process a series of repeated data as well. If 2 as the default value, when the checkpoint fails occasionally, and the next checkpoint succeeds, the flink job won't process repeated data. 4. The description of rescale-on-failed-checkpoints-count is "The number of subsequent failed checkpoints that will initiate rescaling." IIUC, the "consecutive" is more accurate than subsequent here. WDYT? 5. Proposed Changes part is specific implementation, I'm not sure whether all internal interfaces are best for the current version. So I cannot give any suggestion or feedback for now. But I'm happy to review them when your PR is ready if I have time. Feel free to cc me (I'm interested in Adaptive Scheduler) 6. This proposal aims to improve one logic inside of Adaptive Scheduler. Would you mind mentioning Adaptive Scheduler in the FLIP title? It will be useful for users to understand which component this proposal belongs to. Also, I also don't understand why this proposal needs to care about the checkpoint type is unaligned checkpoint or aligned checkpoint. Please correct me if anything is wrong, thanks. Best, Rui On Wed, Jun 5, 2024 at 3:01 PM Matthias Pohl <map...@apache.org> wrote: > Hi ConradJam, > thanks for your response. > > The CheckpointStatsTracker gets notified about the checkpoint completion > after the checkpoint is finalized, i.e. all its data is persisted and the > metadata is written to the CompletedCheckpointStore. At this moment, the > checkpoint is considered for restoring a job and, therefore, becomes > available for restarts. This workflow also applies to unaligned > checkpoints. But I see how this context might be helpful for understanding > the change. I will add it to the FLIP. So far, I don't see a reason > to disable the feature for unaligned checkpoints. Do you see other issues > that might make it necessary to disable this feature for this type of > checkpoints? > > Can you elaborate a bit more what you mean by "checkpoints that do not > check it"? I do not fully understand what you are referring to with "it" > here. > > Best, > Matthias > > On Wed, Jun 5, 2024 at 4:46 AM ConradJam <jam.gz...@gmail.com> wrote: > > > I have a few questions: > > Unaligned checkpoints Do we need to enable this feature? Whether this > > feature should be disabled for checkpoints that do not check it > > > > Matthias Pohl <map...@apache.org> 于2024年6月4日周二 18:03写道: > > > > > Hi everyone, > > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the synchronization > > of > > > rescaling and the completion of checkpoints. The idea is to reduce the > > > amount of data that needs to be processed after rescaling happened. A > > more > > > detailed motivation can be found in FLIP-461. > > > > > > I'm looking forward to feedback and suggestions. > > > > > > Best, > > > Matthias > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing > > > > > > > > > -- > > Best > > > > ConradJam > > >