Hi Matthias, Thanks for your reply!
That's something that could be considered as another optimization. I would > consider this as a possible follow-up. My concern here is that we'd make > the rescaling configuration even more complicated by introducing yet > another parameter. I'd be fine with considering this as a follow-up. It might be worth renaming the internal interface into something that > indicates its internal usage to avoid confusion. > Agree with this. And another question: I noticed the existing options under 'jobmanager.adaptive-scheduler' are using the word 'scaling', e.g. 'jobmanager.adaptive-scheduler.scaling-interval.min'. While in this FLIP you choose 'rescale'. Would you mind unifying them? Best, Zakelly On Thu, Jun 6, 2024 at 10:57 PM David Morávek <david.mora...@gmail.com> wrote: > Thanks for the FLIP Matthias, I think it looks pretty solid! > > I also don't see a relation to unaligned checkpoints. From the AS > perspective, the checkpoint time doesn't matter. > > Is it possible a change event observed right after a complete checkpoint > > (or within a specific short time after a checkpoint) that triggers a > > rescale immediately? Sometimes the checkpoint interval is huge and it is > > better to rescale immediately. > > > > I had considered this initially too, but it feels like a possible follow-up > optimization. > > The primary objective of the proposed solution is to enhance overall > predictability. With a longer checkpointing interval, the current situation > worsens as we might have to reprocess a substantial backlog. > > I think in the future we might actually want to enhance this by triggering > some kind of specialized "rescaling" checkpoint that prepares the cluster > for rescaling (eg. by replicating state to new slots / pre-splitting the > db, ...), to make things faster. > > Best, > D. > > On Wed, Jun 5, 2024 at 4:34 PM Matthias Pohl <map...@apache.org> wrote: > > > Hi Zakelly, > > thanks for your reply. See my inlined responses below: > > > > On Wed, Jun 5, 2024 at 10:26 AM Zakelly Lan <zakelly....@gmail.com> > wrote: > > > > > Hi Matthias, > > > > > > Thanks for your proposal! I have a few questions: > > > > > > 1. Is it possible a change event observed right after a complete > > checkpoint > > > (or within a specific short time after a checkpoint) that triggers a > > > rescale immediately? Sometimes the checkpoint interval is huge and it > is > > > better to rescale immediately. > > > > > > > That's something that could be considered as another optimization. I > would > > consider this as a possible follow-up. My concern here is that we'd make > > the rescaling configuration even more complicated by introducing yet > > another parameter. > > > > > > > 2. Should we introduce `CheckpointLifecycleListener` instead of reusing > > > `CheckpointListener`? Is `CheckpointListener` enough for this scenario? > > > > > > > Good point, they are serving similar purposes. But I'm hesitant to use > > CheckpointListener (which is a public interface) for this internal quite > > narrowly scoped runtime-specific use case of FLIP-461. > > > > It might be worth renaming the internal interface into something that > > indicates its internal usage to avoid confusion. > > > > > > > Best, > > > Zakelly > > > > > > On Wed, Jun 5, 2024 at 3:02 PM Matthias Pohl <map...@apache.org> > wrote: > > > > > > > Hi ConradJam, > > > > thanks for your response. > > > > > > > > The CheckpointStatsTracker gets notified about the checkpoint > > completion > > > > after the checkpoint is finalized, i.e. all its data is persisted and > > the > > > > metadata is written to the CompletedCheckpointStore. At this moment, > > the > > > > checkpoint is considered for restoring a job and, therefore, becomes > > > > available for restarts. This workflow also applies to unaligned > > > > checkpoints. But I see how this context might be helpful for > > > understanding > > > > the change. I will add it to the FLIP. So far, I don't see a reason > > > > to disable the feature for unaligned checkpoints. Do you see other > > issues > > > > that might make it necessary to disable this feature for this type of > > > > checkpoints? > > > > > > > > Can you elaborate a bit more what you mean by "checkpoints that do > not > > > > check it"? I do not fully understand what you are referring to with > > "it" > > > > here. > > > > > > > > Best, > > > > Matthias > > > > > > > > On Wed, Jun 5, 2024 at 4:46 AM ConradJam <jam.gz...@gmail.com> > wrote: > > > > > > > > > I have a few questions: > > > > > Unaligned checkpoints Do we need to enable this feature? Whether > this > > > > > feature should be disabled for checkpoints that do not check it > > > > > > > > > > Matthias Pohl <map...@apache.org> 于2024年6月4日周二 18:03写道: > > > > > > > > > > > Hi everyone, > > > > > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the > > > synchronization > > > > > of > > > > > > rescaling and the completion of checkpoints. The idea is to > reduce > > > the > > > > > > amount of data that needs to be processed after rescaling > > happened. A > > > > > more > > > > > > detailed motivation can be found in FLIP-461. > > > > > > > > > > > > I'm looking forward to feedback and suggestions. > > > > > > > > > > > > Best, > > > > > > Matthias > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best > > > > > > > > > > ConradJam > > > > > > > > > > > > > > >