Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

Matthias Pohl Wed, 05 Jun 2024 07:25:32 -0700

Thanks Rui for your reply. Find my answers inlined below:

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout


On Wed, Jun 5, 2024 at 10:16 AM Rui Fan <1996fan...@gmail.com> wrote:

> Thanks Matthias for driving this proposal!
>
> This proposal can reduce the amount of data that is processed repeatedly
> after rescaling, so this proposal makes sense to me.
>
> I have some questions:
> 1. The public change only includes the "New Configuration Parameters"
> part, right?
>

Correct. I updated the section to make this a bit clearer.


> 2. jobmanager.adaptive-scheduler.rescale-on-failed-checkpoints-count is
> obviously a config option for users. But I'm not sure whether
> jobmanager.adaptive-scheduler.max-delay-for-rescale-trigger is a config
> option or an internal logic? I saw it's computed by
> rescale-on-failed-checkpoints-count.
>

That's a fair point. I wanted the user to be able to go back to the old
implementation even if checkpointing is enabled. One could argue that there
is no need for the parameter being expressed through a Duration. The only
motivation of delaying the rescaling might be waiting for consecutive
change events (which is similar to what we already have with
resource-stabilization-timeout that is utilized in the WaitingForResource
state [1]). Maybe, let's wait for other feedback here.


> 3. I'm not sure if the default value of rescale-on-failed-checkpoints-count
> should be 1 or is greater than 1 better?
>    If 1 as the default value, when the checkpoint fails occasionally, and
> rescale happens, flink job will process a series of repeated data as well.
>    If 2 as the default value, when the checkpoint fails occasionally, and
> the next checkpoint succeeds, the flink job won't process repeated data.
>

You're right. Using 2 as a default value sounds reasonable to work around
occasional "hiccups". My main motivation to set it to 1 was to be as close
as possible to the current (pre-FLIP-461) behavior where the rescale
happens immediately.

But I start to lean towards following your proposal here. I won't update
the FLIP in this regard for now to see what others have to say.

4. The description of rescale-on-failed-checkpoints-count is
>   "The number of subsequent failed checkpoints that will initiate
> rescaling."
>   IIUC, the "consecutive" is more accurate than subsequent here. WDYT?
>

Good idea. I will update the FLIP accordingly.


> 5. Proposed Changes part is specific implementation, I'm not sure whether
>    all internal interfaces are best for the current version. So I cannot
> give any suggestion or feedback for now. But I'm happy to review them when
> your PR is ready if I have time.
>   Feel free to cc me (I'm interested in Adaptive Scheduler)
>

Will do.


> 6. This proposal aims to improve one logic inside of Adaptive Scheduler.
>    Would you mind mentioning Adaptive Scheduler in the FLIP title? It will
>    be useful for users to understand which component this proposal belongs
> to.
>

Good point. I updated the FLIPs title.


>
> Also, I also don't understand why this proposal needs to care about the
> checkpoint type is unaligned checkpoint or aligned checkpoint.
>
> Please correct me if anything is wrong, thanks.
>
> Best,
> Rui

On Wed, Jun 5, 2024 at 3:01 PM Matthias Pohl <map...@apache.org> wrote:
>
> > Hi ConradJam,
> > thanks for your response.
> >
> > The CheckpointStatsTracker gets notified about the checkpoint completion
> > after the checkpoint is finalized, i.e. all its data is persisted and the
> > metadata is written to the CompletedCheckpointStore. At this moment, the
> > checkpoint is considered for restoring a job and, therefore, becomes
> > available for restarts. This workflow also applies to unaligned
> > checkpoints. But I see how this context might be helpful for
> understanding
> > the change. I will add it to the FLIP. So far, I don't see a reason
> > to disable the feature for unaligned checkpoints. Do you see other issues
> > that might make it necessary to disable this feature for this type of
> > checkpoints?
> >
> > Can you elaborate a bit more what you mean by "checkpoints that do not
> > check it"? I do not fully understand what you are referring to with "it"
> > here.
> >
> > Best,
> > Matthias
> >
> > On Wed, Jun 5, 2024 at 4:46 AM ConradJam <jam.gz...@gmail.com> wrote:
> >
> > > I have a few questions:
> > > Unaligned checkpoints Do we need to enable this feature? Whether this
> > > feature should be disabled for checkpoints that do not check it
> > >
> > > Matthias Pohl <map...@apache.org> 于2024年6月4日周二 18:03写道：
> > >
> > > > Hi everyone,
> > > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the
> synchronization
> > > of
> > > > rescaling and the completion of checkpoints. The idea is to reduce
> the
> > > > amount of data that needs to be processed after rescaling happened. A
> > > more
> > > > detailed motivation can be found in FLIP-461.
> > > >
> > > > I'm looking forward to feedback and suggestions.
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing
> > > >
> > >
> > >
> > > --
> > > Best
> > >
> > > ConradJam
> > >
> >
>

Re: [DISCUSS] FLIP-461: FLIP-461: Synchronize rescaling with checkpoint creation to minimize reprocessing

Reply via email to