Re: [DISCUSS] FLIP-XXX: Aligning timeout logic in the AdaptiveScheduler's WaitingForResources and Executing states

Zdenek Tison Mon, 05 Aug 2024 00:29:00 -0700

Thanks everyone for your feedback. Since there were no open questions, we
have started voting [1].


Best,
Zdenek

[1] https://lists.apache.org/thread/3yod1cf06qxf5jny0hrbcxp33tlp7vs6

On Mon, Aug 5, 2024 at 9:03 AM Matthias Pohl <[email protected]>
wrote:

> Ok, I went ahead and updated the FLIP accordingly [1].
>
> [1]
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=315494203&selectedPageVersions=7&selectedPageVersions=5
>
> On Mon, Aug 5, 2024 at 8:56 AM Zdenek Tison <[email protected]>
> wrote:
>
> > Hi Rui,
> >
> > Yes, it's right.
> > We will update the FLIP's parameters table to make it more explicit.
> >
> > Thanks
> >
> > Best,
> > Zdenek
> >
> > On Mon, Aug 5, 2024 at 5:33 AM Rui Fan <[email protected]> wrote:
> >
> > > Thanks Zdenek Tison and Mattias for driving this proposal!
> > > It's indeed a great improvement for Adaptive Scheduler.
> > >
> > > Sorry for the late reply, overall LGTM, I have one minor comment:
> > >
> > > These 2 configuration options were introduced since 2.0, and it's not
> > > released to users.
> > > So we can update them directly, and don't need to consider them as
> > fallback
> > > options, right?
> > >
> > > - jobmanager.adaptive-scheduler.scale-on-failed-checkpoints-count
> > > - jobmanager.adaptive-scheduler.max-delay-for-scale-trigger
> > >
> > > Best,
> > > Rui
> > >
> > > On Sat, Aug 3, 2024 at 12:20 AM Matthias Pohl
> <[email protected]
> > >
> > > wrote:
> > >
> > > > Thanks Zdenek for addressing the comments. I copied the draft into
> the
> > > FLIP
> > > > collection under FLIP-472 [1].
> > > > Looks like there are no additional comments. Feel free to open a
> voting
> > > > thread on this proposal.
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-472%3A+Aligning+timeout+logic+in+the+AdaptiveScheduler%27s+WaitingForResources+and+Executing+states
> > > >
> > > > On Tue, Jul 30, 2024 at 10:48 AM Zdenek Tison
> > > <[email protected]
> > > > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > If there are no further comments, I would propose starting a vote
> on
> > > > these
> > > > > changes. But first, I would like to ask a committer to migrate the
> > > draft
> > > > to
> > > > > an FLIP in the Flink Wiki.
> > > > >
> > > > > Thanks a lot.
> > > > >
> > > > > Kind Regards,
> > > > >
> > > > > Zdenek
> > > > >
> > > > > On Tue, Jul 30, 2024 at 10:36 AM Zdenek Tison <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > Based on the discussion, I added a new configuration:
> > > > > >
> > > >
> > *jobmanager.adaptive-scheduler.executing.resource-stabilization-timeout*.
> > > > > > We considered the following options for the default value:
> > > > > >
> > > > > >    1. Use a separate default value, e.g., 60s.
> > > > > >    2. Fallback to
> > > > > >
> *jobmanager.adaptive-scheduler.resource-stabilization-timeout*.
> > > > > >    3. Use the value from
> > > > > >    *jobmanager.adaptive-scheduler.scaling-interval.max.*
> > > > > >    4. Use a large number like Duration.ofMillis(Long.MAX_VALUE).
> > > > > >
> > > > > > We decided against option 2) because, as discussed in the mailing
> > > list,
> > > > > > the value can be too low. Option 3 was also ruled out since it
> can
> > be
> > > > too
> > > > > > high or unset and *scaling-interval.ma <
> http://scaling-interval.ma
> > > >*x
> > > > > > serves a different use case (it works well with
> > > > *parallelism-increase*).
> > > > > > Option 4 was not chosen because it would affect existing jobs
> after
> > > > > > migration. After migrating to the new Flink version, rescaling
> > would
> > > > only
> > > > > > happen if the desired resources were available. However,
> rescaling
> > > > > happened
> > > > > > with every resource change before migration.
> > > > > >
> > > > > > Therefore, I prefer a new default value: 60s.
> > > > > >
> > > > > >
> > > > > > Additionally, we reviewed the current set of parameters and think
> > > there
> > > > > is
> > > > > > a change to align the parameters along the functionality with the
> > > > release
> > > > > > of 2.0. So, we propose to have these parameters:
> > > > > >
> > > >
> > *jobmanager.adaptive-scheduler.submission.resource-stabilization-timeout
> > > > > *
> > > > > > *jobmanager.adaptive-scheduler.submission.resource-wait-timeout*
> > > > > >
> > > > > >
> *jobmanager.adaptive-scheduler.executing.cooldown-after-rescaling*
> > > > > >
> > > >
> > *jobmanager.adaptive-scheduler.executing.resource-stabilization-timeout*
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> *jobmanager.adaptive-scheduler.executing.rescale-trigger.max-checkpoint-failures*
> > > > > >
> *jobmanager.adaptive-scheduler.executing.rescale-trigger.max-delay*
> > > > > >
> > > > > > Link to the updated FLIP doc.
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1YeYSs64LqgUr3xyBTCjiRE-CT5VEyHjGjqxnxKPIQhM/edit
> > > > > >
> > > > > >
> > > > > > Thanks a lot.
> > > > > >
> > > > > > Regards,
> > > > > > Zdenek
> > > > > >
> > > > > > On Wed, Jul 24, 2024 at 2:22 PM Zdenek Tison <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > >> Hi Gyula,
> > > > > >>
> > > > > >> Thank you for reviewing the document and providing feedback.
> > > > > >>
> > > > > >>    1. I agree that we need two separate parameters for
> > stabilization
> > > > > >>    intervals in different states. I will update the FLIP
> document
> > > > > accordingly.
> > > > > >>    2. That's correct. We reached the same conclusion while
> > > prototyping
> > > > > >>    the implementation. I will add a new bullet point to the FLIP
> > > > > document.
> > > > > >>
> > > > > >> Thanks a lot.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Zdenek
> > > > > >>
> > > > > >>
> > > > > >> On Tue, Jul 23, 2024 at 3:02 PM Gyula Fóra <[email protected]>
> > > wrote:
> > > > > >>
> > > > > >>> Hi All!
> > > > > >>>
> > > > > >>> Thank you for the proposal, I think it will be great to
> simplify
> > > the
> > > > > >>> current rescaling flow to make it more digestible :)
> > > > > >>>
> > > > > >>> I have 2 comments:
> > > > > >>>
> > > > > >>> 1. Related to what Matthias already pointed out, I think in
> > > > production
> > > > > >>> scenarios it may be a typical requirement to have a fairly
> short
> > > > > >>> stabilization interval for job startup (reduce downtime) but
> > > overall
> > > > a
> > > > > >>> longer stabilization period for Executing jobs before rescaling
> > to
> > > > > avoid
> > > > > >>> fluctuations and therefore reduce downtime. I think it would be
> > > very
> > > > > >>> important to have 2 configs for that, one could fall back to
> the
> > > > other
> > > > > of
> > > > > >>> course if undefined.
> > > > > >>>
> > > > > >>> 2. The document mentions that the stabilization period for
> > > executing
> > > > > jobs
> > > > > >>> is measured from the first resource event. I feel that if after
> > the
> > > > > >>> stabilization period we dont have sufficient resources we
> should
> > > > > >>> completely
> > > > > >>> reset this timer and start the timeout from 0 when the next
> event
> > > > > >>> arrives.
> > > > > >>> This will be more in line with the concept of stabilization,
> > > > otherwise
> > > > > if
> > > > > >>> you receive a batch of new resources you may not utilize it
> > because
> > > > as
> > > > > >>> soon
> > > > > >>> as you have sufficient we rescale immediately.
> > > > > >>>
> > > > > >>> Cheers,
> > > > > >>> Gyula
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>> On Thu, Jul 18, 2024 at 9:58 AM Zdenek Tison
> > > > > <[email protected]
> > > > > >>> >
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>> > Thanks, Mathias, for your opinions.
> > > > > >>> >
> > > > > >>> > I see two scenarios where different values for starting and
> > > > rescaling
> > > > > >>> would
> > > > > >>> > be appropriate:
> > > > > >>> >
> > > > > >>> > 1) Flink serverless providers may prefer the fastest possible
> > job
> > > > > >>> startup
> > > > > >>> > time, which can also be achieved by setting a smaller value
> for
> > > the
> > > > > >>> > stabilization timeout, such as 1 second, in the
> > > WaitingForResources
> > > > > >>> state.
> > > > > >>> > Conversely, to ensure maximum job uptime, it would be prudent
> > to
> > > > > >>> increase
> > > > > >>> > the stabilization period for rescaling to a higher value,
> such
> > > as 1
> > > > > >>> minute,
> > > > > >>> > to handle server/node maintenance effectively.
> > > > > >>> >
> > > > > >>> > 2) In Reactive mode, the stabilization period is set to 0 by
> > > > default.
> > > > > >>> > Setting a different default value for the rescale state could
> > > > enhance
> > > > > >>> job
> > > > > >>> > stability during node maintenance, especially since the
> > parameter
> > > > > >>> > min-parallelism-increase is no longer applicable.
> > > > > >>> >
> > > > > >>> > Regards,
> > > > > >>> >
> > > > > >>> > Zdenek
> > > > > >>> >
> > > > > >>> > On Tue, Jul 16, 2024 at 5:49 PM Matthias Pohl <
> > [email protected]
> > > >
> > > > > >>> wrote:
> > > > > >>> >
> > > > > >>> > > Thanks Zdenek for your proposal on aligning the resource
> > > control
> > > > > >>> logic
> > > > > >>> > > within the AdaptiveScheduler and cleaning up the rescaling
> > > code.
> > > > > >>> > >
> > > > > >>> > > Consolidating the parameters and the code as part of the
> 2.0
> > > > > release
> > > > > >>> > makes
> > > > > >>> > > sense in my opinion: The proposed change adds consistent
> > > behavior
> > > > > to
> > > > > >>> the
> > > > > >>> > > WaitingForResources and Executing states of the
> > > AdaptiveScheduler
> > > > > and
> > > > > >>> > irons
> > > > > >>> > > out some flaws of the current implementation. This should
> > help
> > > > > users
> > > > > >>> get
> > > > > >>> > a
> > > > > >>> > > clearer picture of the resource control logic. Removing
> > > obsolete
> > > > > >>> rescale
> > > > > >>> > > waiting time if only sufficient resources are available is
> > > also a
> > > > > >>> nice
> > > > > >>> > > improvement.
> > > > > >>> > >
> > > > > >>> > > The j.a.min-parallelism-increase [1] parameter became kind
> of
> > > > > >>> obsolete
> > > > > >>> > with
> > > > > >>> > > the introduction of the rescale REST endpoint in FLIP-291
> [2]
> > > as
> > > > > you
> > > > > >>> > > pointed out in the FLIP. So, deprecating it sounds
> > reasonable.
> > > > > >>> > >
> > > > > >>> > > On the topic of replacing the j.a.scaling-interval.max
> > > parameter
> > > > > [3]
> > > > > >>> with
> > > > > >>> > > the j.a.resource-stabilization-timeout [4]: I'm in favor of
> > > > > reducing
> > > > > >>> the
> > > > > >>> > > complexity of the Flink configuration. Therefore, using one
> > > > > >>> parameter for
> > > > > >>> > > both (WaitingForResources and Executing state) to stabilize
> > the
> > > > > >>> resources
> > > > > >>> > > sounds like a good idea.
> > > > > >>> > >
> > > > > >>> > > I'm wondering whether there are scenarios, where we would
> > want
> > > to
> > > > > >>> have
> > > > > >>> > > different stabilization timeouts for starting
> > > > (WaitingForResources)
> > > > > >>> and
> > > > > >>> > > rescaling (Executing) a job. In that case, having two
> > resource
> > > > > >>> > > stabilization parameters (one job starts and one for
> > rescales)
> > > > with
> > > > > >>> one
> > > > > >>> > > being the fallback for the other is a straight-forward
> > > solution.
> > > > > >>> > >
> > > > > >>> > > Just as a side note because it came up: Keep in mind that
> > > > FLIP-461
> > > > > >>> still
> > > > > >>> > > allows for immediate rescaling on a change event if
> > > checkpointing
> > > > > is
> > > > > >>> > > disabled or j.a.max-delay-for-scale-trigger [5] is
> configured
> > > > > >>> > accordingly.
> > > > > >>> > >
> > > > > >>> > > Best,
> > > > > >>> > > Matthias
> > > > > >>> > >
> > > > > >>> > > [1]
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-min-parallelism-increase
> > > > > >>> > > [2]
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > >>> > > [3]
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max
> > > > > >>> > > [4]
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout
> > > > > >>> > > [5]
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-max-delay-for-scale-trigger
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > >
> > > > > >>> > > On Tue, Jul 16, 2024 at 3:05 PM Zdenek Tison
> > > > > >>> <[email protected]
> > > > > >>> > >
> > > > > >>> > > wrote:
> > > > > >>> > >
> > > > > >>> > > > Hi, I'd like to move a discussion from Google Docs to the
> > > > mailing
> > > > > >>> list
> > > > > >>> > so
> > > > > >>> > > > that it's visible to everyone.
> > > > > >>> > > >
> > > > > >>> > > > *Yuanfeng Hu* brought up two concerns:
> > > > > >>> > > >
> > > > > >>> > > > 1) Related to the resource-stabilization-timeout,he
> thinks
> > > 10s
> > > > > May
> > > > > >>> be
> > > > > >>> > too
> > > > > >>> > > > short. In a container environment, if the number of tm
> > added
> > > by
> > > > > >>> rest
> > > > > >>> > > > requests is greater than 1, the tm initialization time
> may
> > be
> > > > > much
> > > > > >>> > longer
> > > > > >>> > > > than 10s.
> > > > > >>> > > >
> > > > > >>> > > > and
> > > > > >>> > > >
> > > > > >>> > > > 2) He proposed a little scenario:
> > > > > >>> > > > There is 1 slot in the entire cluster. At this time, my
> > task
> > > is
> > > > > >>> running
> > > > > >>> > > at
> > > > > >>> > > > 1 parallelism (the required slot is also 1). Then I add a
> > > > > >>> tm(1slot),
> > > > > >>> > > which
> > > > > >>> > > > will obviously trigger a change event, and it will become
> > > > stable
> > > > > >>> after
> > > > > >>> > 10
> > > > > >>> > > > seconds. If I change the required resources to 3 through
> > rest
> > > > at
> > > > > >>> this
> > > > > >>> > > time,
> > > > > >>> > > > rescale will be triggered immediately. and runs at a
> > > > parallelism
> > > > > >>> of 2,
> > > > > >>> > Is
> > > > > >>> > > > this the expected result, or do we expect that the
> Rescale
> > > will
> > > > > be
> > > > > >>> > > > triggered after adding another tm, because this exactly
> > > matches
> > > > > the
> > > > > >>> > > > required resources
> > > > > >>> > > >
> > > > > >>> > > > Thank you, *Yuanfeng Hu, *for opening the discussion.
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> ---------------------------------------------------------------------------------------
> > > > > >>> > > >
> > > > > >>> > > > 1) Regarding the stabilization period:
> > > > > >>> > > >
> > > > > >>> > > > I am unsure what you mean by the part, 'if the number of
> tm
> > > > added
> > > > > >>> by
> > > > > >>> > rest
> > > > > >>> > > > requests is greater than 1.' However, I understand that
> it
> > > can
> > > > > take
> > > > > >>> > some
> > > > > >>> > > > time to spawn additional containers/pods in a
> containerized
> > > > > >>> > environment.
> > > > > >>> > > On
> > > > > >>> > > > the other hand, if a user adds more TMs, for instance, by
> > > > > >>> increasing
> > > > > >>> > the
> > > > > >>> > > > number of replicas in a Kubernetes deployment, these
> > replicas
> > > > > >>> should
> > > > > >>> > > appear
> > > > > >>> > > > with some delay but at a similar time, correct?
> > > > > >>> > > >
> > > > > >>> > > > It's worth mentioning that since  FLIP-461
> > > > > >>> > > > <
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler
> > > > > >>> > > > >,
> > > > > >>> > > > the
> > > > > >>> > > > rescale operation is synchronized with checkpoint events,
> > so
> > > > the
> > > > > >>> > rescale
> > > > > >>> > > > doesn't happen right after this timeout expires.
> > > > > >>> > > >
> > > > > >>> > > > If we believe it is necessary to have different values
> for
> > > the
> > > > > >>> > > > stabilization period in the Executing and
> > WaitingForResources
> > > > > >>> states,
> > > > > >>> > > even
> > > > > >>> > > > though this increases configuration complexity slightly,
> we
> > > > could
> > > > > >>> have
> > > > > >>> > > > separate parameters for these two states:
> > > > > >>> > > >
> > jobmanager.adaptive-scheduler.resource-stabilization-timeout
> > > > > >>> > > > <
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-resource-stabilization-timeout
> > > > > >>> > > > >
> > > > > >>> > > >  and
> > > > *jobmanager.adaptive-scheduler.scaling-stabilization-timeout
> > > > > >>> > > > *(replacing
> > > > > >>> > > > the jobmanager.adaptive-scheduler.scaling-interval.max
> > > > > >>> > > > <
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#jobmanager-adaptive-scheduler-scaling-interval-max
> > > > > >>> > > > >
> > > > > >>> > > > ).
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > > *2) *Regarding the proposed scenario:
> > > > > >>> > > >
> > > > > >>> > > > The same behavior occurs in the current Flink version
> when
> > > the
> > > > > >>> > > > `min-parallelism-increase` is set to its default value 1.
> > In
> > > > this
> > > > > >>> case,
> > > > > >>> > > the
> > > > > >>> > > > rescale operation is triggered immediately or aligned
> with
> > > the
> > > > > >>> > checkpoint
> > > > > >>> > > > event (specified in FLIP-461).
> > > > > >>> > > > So, I would say the behavior is expected.
> > > > > >>> > > > Additionally, users can configure the rescaling behavior.
> > For
> > > > > >>> example,
> > > > > >>> > > if a
> > > > > >>> > > > user sets the lower bound parallelism to 2 and the upper
> > > bound
> > > > to
> > > > > >>> 3,
> > > > > >>> > the
> > > > > >>> > > > system will rescale after 10 seconds. Alternatively, if
> the
> > > > user
> > > > > >>> sets
> > > > > >>> > the
> > > > > >>> > > > same value for the lower and upper bounds, the rescale
> > > > operation
> > > > > >>> will
> > > > > >>> > > wait
> > > > > >>> > > > until all slots are available.
> > > > > >>> > > >
> > > > > >>> > > > Best Regrads,
> > > > > >>> > > > Zdenek Tison
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > >
> > > > > >>> > > > On Thu, Jul 11, 2024 at 2:38 PM Zdenek Tison <
> > > > > [email protected]>
> > > > > >>> > > wrote:
> > > > > >>> > > >
> > > > > >>> > > > > Hello,
> > > > > >>> > > > >
> > > > > >>> > > > > Our team has been working on several improvements for
> > > > > >>> > > AdaptiveScheduler,
> > > > > >>> > > > > specifically focusing on aligning logic and timeouts in
> > the
> > > > > >>> > > > > WaitingForResources and Executing states. We believe
> > these
> > > > > >>> > enhancements
> > > > > >>> > > > > will improve the adaptive scheduler's robustness and
> > > > > >>> maintainability.
> > > > > >>> > > > >
> > > > > >>> > > > > For more detailed information, please refer to the FLIP
> > > > > document.
> > > > > >>> > > > >
> > > > > >>> > > > >
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1YeYSs64LqgUr3xyBTCjiRE-CT5VEyHjGjqxnxKPIQhM/edit?usp=sharing
> > > > > >>> > > > >
> > > > > >>> > > > > Thanks,
> > > > > >>> > > > > Zdenek Tison
> > > > > >>> > > > >
> > > > > >>> > > >
> > > > > >>> > >
> > > > > >>> >
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-XXX: Aligning timeout logic in the AdaptiveScheduler's WaitingForResources and Executing states

Reply via email to