Thanks for the FLIP Matthias, I think it looks pretty solid! I also don't see a relation to unaligned checkpoints. From the AS perspective, the checkpoint time doesn't matter.
Is it possible a change event observed right after a complete checkpoint > (or within a specific short time after a checkpoint) that triggers a > rescale immediately? Sometimes the checkpoint interval is huge and it is > better to rescale immediately. > I had considered this initially too, but it feels like a possible follow-up optimization. The primary objective of the proposed solution is to enhance overall predictability. With a longer checkpointing interval, the current situation worsens as we might have to reprocess a substantial backlog. I think in the future we might actually want to enhance this by triggering some kind of specialized "rescaling" checkpoint that prepares the cluster for rescaling (eg. by replicating state to new slots / pre-splitting the db, ...), to make things faster. Best, D. On Wed, Jun 5, 2024 at 4:34 PM Matthias Pohl <map...@apache.org> wrote: > Hi Zakelly, > thanks for your reply. See my inlined responses below: > > On Wed, Jun 5, 2024 at 10:26 AM Zakelly Lan <zakelly....@gmail.com> wrote: > > > Hi Matthias, > > > > Thanks for your proposal! I have a few questions: > > > > 1. Is it possible a change event observed right after a complete > checkpoint > > (or within a specific short time after a checkpoint) that triggers a > > rescale immediately? Sometimes the checkpoint interval is huge and it is > > better to rescale immediately. > > > > That's something that could be considered as another optimization. I would > consider this as a possible follow-up. My concern here is that we'd make > the rescaling configuration even more complicated by introducing yet > another parameter. > > > > 2. Should we introduce `CheckpointLifecycleListener` instead of reusing > > `CheckpointListener`? Is `CheckpointListener` enough for this scenario? > > > > Good point, they are serving similar purposes. But I'm hesitant to use > CheckpointListener (which is a public interface) for this internal quite > narrowly scoped runtime-specific use case of FLIP-461. > > It might be worth renaming the internal interface into something that > indicates its internal usage to avoid confusion. > > > > Best, > > Zakelly > > > > On Wed, Jun 5, 2024 at 3:02 PM Matthias Pohl <map...@apache.org> wrote: > > > > > Hi ConradJam, > > > thanks for your response. > > > > > > The CheckpointStatsTracker gets notified about the checkpoint > completion > > > after the checkpoint is finalized, i.e. all its data is persisted and > the > > > metadata is written to the CompletedCheckpointStore. At this moment, > the > > > checkpoint is considered for restoring a job and, therefore, becomes > > > available for restarts. This workflow also applies to unaligned > > > checkpoints. But I see how this context might be helpful for > > understanding > > > the change. I will add it to the FLIP. So far, I don't see a reason > > > to disable the feature for unaligned checkpoints. Do you see other > issues > > > that might make it necessary to disable this feature for this type of > > > checkpoints? > > > > > > Can you elaborate a bit more what you mean by "checkpoints that do not > > > check it"? I do not fully understand what you are referring to with > "it" > > > here. > > > > > > Best, > > > Matthias > > > > > > On Wed, Jun 5, 2024 at 4:46 AM ConradJam <jam.gz...@gmail.com> wrote: > > > > > > > I have a few questions: > > > > Unaligned checkpoints Do we need to enable this feature? Whether this > > > > feature should be disabled for checkpoints that do not check it > > > > > > > > Matthias Pohl <map...@apache.org> 于2024年6月4日周二 18:03写道: > > > > > > > > > Hi everyone, > > > > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the > > synchronization > > > > of > > > > > rescaling and the completion of checkpoints. The idea is to reduce > > the > > > > > amount of data that needs to be processed after rescaling > happened. A > > > > more > > > > > detailed motivation can be found in FLIP-461. > > > > > > > > > > I'm looking forward to feedback and suggestions. > > > > > > > > > > Best, > > > > > Matthias > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing > > > > > > > > > > > > > > > > > -- > > > > Best > > > > > > > > ConradJam > > > > > > > > > >