Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

Matthias Pohl Fri, 07 Jun 2024 09:42:29 -0700

Hi Zakelly,
good point. I updated the FLIP to use "scale-on-failed-checkpoints-count"
and "max-delay-for-scale-trigger".


On Fri, Jun 7, 2024 at 12:18 PM Zakelly Lan <zakelly....@gmail.com> wrote:

> Hi Matthias,
>
> Thanks for your reply!
>
> That's something that could be considered as another optimization. I would
> > consider this as a possible follow-up. My concern here is that we'd make
> > the rescaling configuration even more complicated by introducing yet
> > another parameter.
>
>
> I'd be fine with considering this as a follow-up.
>
> It might be worth renaming the internal interface into something that
> > indicates its internal usage to avoid confusion.
> >
>
> Agree with this.
>
> And another question:
> I noticed the existing options under 'jobmanager.adaptive-scheduler' are
> using the word 'scaling', e.g.
> 'jobmanager.adaptive-scheduler.scaling-interval.min'. While in this FLIP
> you choose 'rescale'. Would you mind unifying them?
>
>
> Best,
> Zakelly
>
>
> On Thu, Jun 6, 2024 at 10:57 PM David Morávek <david.mora...@gmail.com>
> wrote:
>
> > Thanks for the FLIP Matthias, I think it looks pretty solid!
> >
> > I also don't see a relation to unaligned checkpoints. From the AS
> > perspective, the checkpoint time doesn't matter.
> >
> > Is it possible a change event observed right after a complete checkpoint
> > > (or within a specific short time after a checkpoint) that triggers a
> > > rescale immediately? Sometimes the checkpoint interval is huge and it
> is
> > > better to rescale immediately.
> > >
> >
> > I had considered this initially too, but it feels like a possible
> follow-up
> > optimization.
> >
> > The primary objective of the proposed solution is to enhance overall
> > predictability. With a longer checkpointing interval, the current
> situation
> > worsens as we might have to reprocess a substantial backlog.
> >
> > I think in the future we might actually want to enhance this by
> triggering
> > some kind of specialized "rescaling" checkpoint that prepares the cluster
> > for rescaling (eg. by replicating state to new slots / pre-splitting the
> > db, ...), to make things faster.
> >
> > Best,
> > D.
> >
> > On Wed, Jun 5, 2024 at 4:34 PM Matthias Pohl <map...@apache.org> wrote:
> >
> > > Hi Zakelly,
> > > thanks for your reply. See my inlined responses below:
> > >
> > > On Wed, Jun 5, 2024 at 10:26 AM Zakelly Lan <zakelly....@gmail.com>
> > wrote:
> > >
> > > > Hi Matthias,
> > > >
> > > > Thanks for your proposal! I have a few questions:
> > > >
> > > > 1. Is it possible a change event observed right after a complete
> > > checkpoint
> > > > (or within a specific short time after a checkpoint) that triggers a
> > > > rescale immediately? Sometimes the checkpoint interval is huge and it
> > is
> > > > better to rescale immediately.
> > > >
> > >
> > > That's something that could be considered as another optimization. I
> > would
> > > consider this as a possible follow-up. My concern here is that we'd
> make
> > > the rescaling configuration even more complicated by introducing yet
> > > another parameter.
> > >
> > >
> > > > 2. Should we introduce `CheckpointLifecycleListener` instead of
> reusing
> > > > `CheckpointListener`? Is `CheckpointListener` enough for this
> scenario?
> > > >
> > >
> > > Good point, they are serving similar purposes. But I'm hesitant to use
> > > CheckpointListener (which is a public interface) for this internal
> quite
> > > narrowly scoped runtime-specific use case of FLIP-461.
> > >
> > > It might be worth renaming the internal interface into something that
> > > indicates its internal usage to avoid confusion.
> > >
> > >
> > > > Best,
> > > > Zakelly
> > > >
> > > > On Wed, Jun 5, 2024 at 3:02 PM Matthias Pohl <map...@apache.org>
> > wrote:
> > > >
> > > > > Hi ConradJam,
> > > > > thanks for your response.
> > > > >
> > > > > The CheckpointStatsTracker gets notified about the checkpoint
> > > completion
> > > > > after the checkpoint is finalized, i.e. all its data is persisted
> and
> > > the
> > > > > metadata is written to the CompletedCheckpointStore. At this
> moment,
> > > the
> > > > > checkpoint is considered for restoring a job and, therefore,
> becomes
> > > > > available for restarts. This workflow also applies to unaligned
> > > > > checkpoints. But I see how this context might be helpful for
> > > > understanding
> > > > > the change. I will add it to the FLIP. So far, I don't see a reason
> > > > > to disable the feature for unaligned checkpoints. Do you see other
> > > issues
> > > > > that might make it necessary to disable this feature for this type
> of
> > > > > checkpoints?
> > > > >
> > > > > Can you elaborate a bit more what you mean by "checkpoints that do
> > not
> > > > > check it"? I do not fully understand what you are referring to with
> > > "it"
> > > > > here.
> > > > >
> > > > > Best,
> > > > > Matthias
> > > > >
> > > > > On Wed, Jun 5, 2024 at 4:46 AM ConradJam <jam.gz...@gmail.com>
> > wrote:
> > > > >
> > > > > > I have a few questions:
> > > > > > Unaligned checkpoints Do we need to enable this feature? Whether
> > this
> > > > > > feature should be disabled for checkpoints that do not check it
> > > > > >
> > > > > > Matthias Pohl <map...@apache.org> 于2024年6月4日周二 18:03写道：
> > > > > >
> > > > > > > Hi everyone,
> > > > > > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the
> > > > synchronization
> > > > > > of
> > > > > > > rescaling and the completion of checkpoints. The idea is to
> > reduce
> > > > the
> > > > > > > amount of data that needs to be processed after rescaling
> > > happened. A
> > > > > > more
> > > > > > > detailed motivation can be found in FLIP-461.
> > > > > > >
> > > > > > > I'm looking forward to feedback and suggestions.
> > > > > > >
> > > > > > > Best,
> > > > > > > Matthias
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best
> > > > > >
> > > > > > ConradJam
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

Reply via email to