Hi Zakelly, good point. I updated the FLIP to use "scale-on-failed-checkpoints-count" and "max-delay-for-scale-trigger".
On Fri, Jun 7, 2024 at 12:18 PM Zakelly Lan <zakelly....@gmail.com> wrote: > Hi Matthias, > > Thanks for your reply! > > That's something that could be considered as another optimization. I would > > consider this as a possible follow-up. My concern here is that we'd make > > the rescaling configuration even more complicated by introducing yet > > another parameter. > > > I'd be fine with considering this as a follow-up. > > It might be worth renaming the internal interface into something that > > indicates its internal usage to avoid confusion. > > > > Agree with this. > > And another question: > I noticed the existing options under 'jobmanager.adaptive-scheduler' are > using the word 'scaling', e.g. > 'jobmanager.adaptive-scheduler.scaling-interval.min'. While in this FLIP > you choose 'rescale'. Would you mind unifying them? > > > Best, > Zakelly > > > On Thu, Jun 6, 2024 at 10:57 PM David Morávek <david.mora...@gmail.com> > wrote: > > > Thanks for the FLIP Matthias, I think it looks pretty solid! > > > > I also don't see a relation to unaligned checkpoints. From the AS > > perspective, the checkpoint time doesn't matter. > > > > Is it possible a change event observed right after a complete checkpoint > > > (or within a specific short time after a checkpoint) that triggers a > > > rescale immediately? Sometimes the checkpoint interval is huge and it > is > > > better to rescale immediately. > > > > > > > I had considered this initially too, but it feels like a possible > follow-up > > optimization. > > > > The primary objective of the proposed solution is to enhance overall > > predictability. With a longer checkpointing interval, the current > situation > > worsens as we might have to reprocess a substantial backlog. > > > > I think in the future we might actually want to enhance this by > triggering > > some kind of specialized "rescaling" checkpoint that prepares the cluster > > for rescaling (eg. by replicating state to new slots / pre-splitting the > > db, ...), to make things faster. > > > > Best, > > D. > > > > On Wed, Jun 5, 2024 at 4:34 PM Matthias Pohl <map...@apache.org> wrote: > > > > > Hi Zakelly, > > > thanks for your reply. See my inlined responses below: > > > > > > On Wed, Jun 5, 2024 at 10:26 AM Zakelly Lan <zakelly....@gmail.com> > > wrote: > > > > > > > Hi Matthias, > > > > > > > > Thanks for your proposal! I have a few questions: > > > > > > > > 1. Is it possible a change event observed right after a complete > > > checkpoint > > > > (or within a specific short time after a checkpoint) that triggers a > > > > rescale immediately? Sometimes the checkpoint interval is huge and it > > is > > > > better to rescale immediately. > > > > > > > > > > That's something that could be considered as another optimization. I > > would > > > consider this as a possible follow-up. My concern here is that we'd > make > > > the rescaling configuration even more complicated by introducing yet > > > another parameter. > > > > > > > > > > 2. Should we introduce `CheckpointLifecycleListener` instead of > reusing > > > > `CheckpointListener`? Is `CheckpointListener` enough for this > scenario? > > > > > > > > > > Good point, they are serving similar purposes. But I'm hesitant to use > > > CheckpointListener (which is a public interface) for this internal > quite > > > narrowly scoped runtime-specific use case of FLIP-461. > > > > > > It might be worth renaming the internal interface into something that > > > indicates its internal usage to avoid confusion. > > > > > > > > > > Best, > > > > Zakelly > > > > > > > > On Wed, Jun 5, 2024 at 3:02 PM Matthias Pohl <map...@apache.org> > > wrote: > > > > > > > > > Hi ConradJam, > > > > > thanks for your response. > > > > > > > > > > The CheckpointStatsTracker gets notified about the checkpoint > > > completion > > > > > after the checkpoint is finalized, i.e. all its data is persisted > and > > > the > > > > > metadata is written to the CompletedCheckpointStore. At this > moment, > > > the > > > > > checkpoint is considered for restoring a job and, therefore, > becomes > > > > > available for restarts. This workflow also applies to unaligned > > > > > checkpoints. But I see how this context might be helpful for > > > > understanding > > > > > the change. I will add it to the FLIP. So far, I don't see a reason > > > > > to disable the feature for unaligned checkpoints. Do you see other > > > issues > > > > > that might make it necessary to disable this feature for this type > of > > > > > checkpoints? > > > > > > > > > > Can you elaborate a bit more what you mean by "checkpoints that do > > not > > > > > check it"? I do not fully understand what you are referring to with > > > "it" > > > > > here. > > > > > > > > > > Best, > > > > > Matthias > > > > > > > > > > On Wed, Jun 5, 2024 at 4:46 AM ConradJam <jam.gz...@gmail.com> > > wrote: > > > > > > > > > > > I have a few questions: > > > > > > Unaligned checkpoints Do we need to enable this feature? Whether > > this > > > > > > feature should be disabled for checkpoints that do not check it > > > > > > > > > > > > Matthias Pohl <map...@apache.org> 于2024年6月4日周二 18:03写道: > > > > > > > > > > > > > Hi everyone, > > > > > > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the > > > > synchronization > > > > > > of > > > > > > > rescaling and the completion of checkpoints. The idea is to > > reduce > > > > the > > > > > > > amount of data that needs to be processed after rescaling > > > happened. A > > > > > > more > > > > > > > detailed motivation can be found in FLIP-461. > > > > > > > > > > > > > > I'm looking forward to feedback and suggestions. > > > > > > > > > > > > > > Best, > > > > > > > Matthias > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best > > > > > > > > > > > > ConradJam > > > > > > > > > > > > > > > > > > > > >