Thanks Matthias for driving this proposal!

This proposal can reduce the amount of data that is processed repeatedly
after rescaling, so this proposal makes sense to me.

I have some questions:
1. The public change only includes the "New Configuration Parameters" part,
right?
2. jobmanager.adaptive-scheduler.rescale-on-failed-checkpoints-count is
obviously
  a config option for users. But I'm not sure whether
  jobmanager.adaptive-scheduler.max-delay-for-rescale-trigger is a config
option
 or an internal logic? I saw it's computed by
rescale-on-failed-checkpoints-count.
3. I'm not sure if the default value of rescale-on-failed-checkpoints-count
should
   be 1 or is greater than 1 better?
   If 1 as the default value, when the checkpoint fails occasionally, and
rescale happens,
   flink job will process a series of repeated data as well.
   If 2 as the default value, when the checkpoint fails occasionally, and
the next
   checkpoint succeeds, the flink job won't process repeated data.
4. The description of rescale-on-failed-checkpoints-count is
  "The number of subsequent failed checkpoints that will initiate
rescaling."
  IIUC, the "consecutive" is more accurate than subsequent here. WDYT?
5. Proposed Changes part is specific implementation, I'm not sure whether
   all internal interfaces are best for the current version. So I cannot
give
  any suggestion or feedback for now. But I'm happy to review them when
  your PR is ready if I have time.
  Feel free to cc me (I'm interested in Adaptive Scheduler)
6. This proposal aims to improve one logic inside of Adaptive Scheduler.
   Would you mind mentioning Adaptive Scheduler in the FLIP title? It will
   be useful for users to understand which component this proposal belongs
to.

Also, I also don't understand why this proposal needs to care about the
checkpoint type is unaligned checkpoint or aligned checkpoint.

Please correct me if anything is wrong, thanks.

Best,
Rui

On Wed, Jun 5, 2024 at 3:01 PM Matthias Pohl <map...@apache.org> wrote:

> Hi ConradJam,
> thanks for your response.
>
> The CheckpointStatsTracker gets notified about the checkpoint completion
> after the checkpoint is finalized, i.e. all its data is persisted and the
> metadata is written to the CompletedCheckpointStore. At this moment, the
> checkpoint is considered for restoring a job and, therefore, becomes
> available for restarts. This workflow also applies to unaligned
> checkpoints. But I see how this context might be helpful for understanding
> the change. I will add it to the FLIP. So far, I don't see a reason
> to disable the feature for unaligned checkpoints. Do you see other issues
> that might make it necessary to disable this feature for this type of
> checkpoints?
>
> Can you elaborate a bit more what you mean by "checkpoints that do not
> check it"? I do not fully understand what you are referring to with "it"
> here.
>
> Best,
> Matthias
>
> On Wed, Jun 5, 2024 at 4:46 AM ConradJam <jam.gz...@gmail.com> wrote:
>
> > I have a few questions:
> > Unaligned checkpoints Do we need to enable this feature? Whether this
> > feature should be disabled for checkpoints that do not check it
> >
> > Matthias Pohl <map...@apache.org> 于2024年6月4日周二 18:03写道:
> >
> > > Hi everyone,
> > > I'd like to discuss FLIP-461 [1]. The FLIP proposes the synchronization
> > of
> > > rescaling and the completion of checkpoints. The idea is to reduce the
> > > amount of data that needs to be processed after rescaling happened. A
> > more
> > > detailed motivation can be found in FLIP-461.
> > >
> > > I'm looking forward to feedback and suggestions.
> > >
> > > Best,
> > > Matthias
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing
> > >
> >
> >
> > --
> > Best
> >
> > ConradJam
> >
>

Reply via email to