Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

Zakelly Lan Fri, 07 Jun 2024 03:18:12 -0700

Hi Matthias,

Thanks for your reply!


That's something that could be considered as another optimization. I would
> consider this as a possible follow-up. My concern here is that we'd make
> the rescaling configuration even more complicated by introducing yet
> another parameter.


I'd be fine with considering this as a follow-up.

It might be worth renaming the internal interface into something that
> indicates its internal usage to avoid confusion.
>

Agree with this.

And another question:
I noticed the existing options under 'jobmanager.adaptive-scheduler' are
using the word 'scaling', e.g.
'jobmanager.adaptive-scheduler.scaling-interval.min'. While in this FLIP
you choose 'rescale'. Would you mind unifying them?


Best,
Zakelly


On Thu, Jun 6, 2024 at 10:57 PM David Morávek <[email protected]>
wrote:

> Thanks for the FLIP Matthias, I think it looks pretty solid!
>
> I also don't see a relation to unaligned checkpoints. From the AS
> perspective, the checkpoint time doesn't matter.
>
> Is it possible a change event observed right after a complete checkpoint
> > (or within a specific short time after a checkpoint) that triggers a
> > rescale immediately? Sometimes the checkpoint interval is huge and it is
> > better to rescale immediately.
> >
>
> I had considered this initially too, but it feels like a possible follow-up
> optimization.
>
> The primary objective of the proposed solution is to enhance overall
> predictability. With a longer checkpointing interval, the current situation
> worsens as we might have to reprocess a substantial backlog.
>
> I think in the future we might actually want to enhance this by triggering
> some kind of specialized "rescaling" checkpoint that prepares the cluster
> for rescaling (eg. by replicating state to new slots / pre-splitting the
> db, ...), to make things faster.
>
> Best,
> D.
>
> On Wed, Jun 5, 2024 at 4:34 PM Matthias Pohl <[email protected]> wrote:
>
> > Hi Zakelly,
> > thanks for your reply. See my inlined responses below:
> >
> > On Wed, Jun 5, 2024 at 10:26 AM Zakelly Lan <[email protected]>
> wrote:
> >
> > > Hi Matthias,
> > >
> > > Thanks for your proposal! I have a few questions:
> > >
> > > 1. Is it possible a change event observed right after a complete
> > checkpoint
> > > (or within a specific short time after a checkpoint) that triggers a
> > > rescale immediately? Sometimes the checkpoint interval is huge and it
> is
> > > better to rescale immediately.
> > >
> >
> > That's something that could be considered as another optimization. I
> would
> > consider this as a possible follow-up. My concern here is that we'd make
> > the rescaling configuration even more complicated by introducing yet
> > another parameter.
> >
> >
> > > 2. Should we introduce `CheckpointLifecycleListener` instead of reusing
> > > `CheckpointListener`? Is `CheckpointListener` enough for this scenario?
> > >
> >
> > Good point, they are serving similar purposes. But I'm hesitant to use
> > CheckpointListener (which is a public interface) for this internal quite
> > narrowly scoped runtime-specific use case of FLIP-461.
> >
> > It might be worth renaming the internal interface into something that
> > indicates its internal usage to avoid confusion.
> >
> >
> > > Best,
> > > Zakelly
> > >
> > > On Wed, Jun 5, 2024 at 3:02 PM Matthias Pohl <[email protected]>
> wrote:
> > >
> > > > Hi ConradJam,
> > > > thanks for your response.
> > > >
> > > > The CheckpointStatsTracker gets notified about the checkpoint
> > completion
> > > > after the checkpoint is finalized, i.e. all its data is persisted and
> > the
> > > > metadata is written to the CompletedCheckpointStore. At this moment,
> > the
> > > > checkpoint is considered for restoring a job and, therefore, becomes
> > > > available for restarts. This workflow also applies to unaligned
> > > > checkpoints. But I see how this context might be helpful for
> > > understanding
> > > > the change. I will add it to the FLIP. So far, I don't see a reason
> > > > to disable the feature for unaligned checkpoints. Do you see other
> > issues
> > > > that might make it necessary to disable this feature for this type of
> > > > checkpoints?
> > > >
> > > > Can you elaborate a bit more what you mean by "checkpoints that do
> not
> > > > check it"? I do not fully understand what you are referring to with
> > "it"
> > > > here.
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > On Wed, Jun 5, 2024 at 4:46 AM ConradJam <[email protected]>
> wrote:
> > > >
> > > > > I have a few questions:
> > > > > Unaligned checkpoints Do we need to enable this feature? Whether
> this
> > > > > feature should be disabled for checkpoints that do not check it
> > > > >
> > > > > Matthias Pohl <[email protected]> 于2024年6月4日周二 18:03写道：
> > > > >
> > > > > > Hi everyone,
> > > > > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the
> > > synchronization
> > > > > of
> > > > > > rescaling and the completion of checkpoints. The idea is to
> reduce
> > > the
> > > > > > amount of data that needs to be processed after rescaling
> > happened. A
> > > > > more
> > > > > > detailed motivation can be found in FLIP-461.
> > > > > >
> > > > > > I'm looking forward to feedback and suggestions.
> > > > > >
> > > > > > Best,
> > > > > > Matthias
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best
> > > > >
> > > > > ConradJam
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

Reply via email to