Re: [DISCUSS] FLIP-487: Show history of rescales in Web UI for AdaptiveScheduler

Yuepeng Pan Mon, 09 Dec 2024 18:38:59 -0800

Thanks Peter for the comments.

> Beside this, should we create new metrics for internal rescale to
> distinguish the metrics for internal restart?
> Created a jira for this https://issues.apache.org/jira/browse/FLINK-36871.


This will make sense to me. 
I'm just wondering, how should the relationship between internal restart and 
rescale be defined? Are they inclusive? Or are they completely opposite? I 
prefer that the restart triggered by rescale is a subset of internal restart to 
other description.  WDYTA?

Best,
Yuepeng.

On 2024/12/09 22:12:36 Peter Huang wrote:
> Thanks for the proposal. It is a very useful feature for the observability
> of autoscaling. . Echo to Eric. We need to add some content about where to
> store
> the rescale events. We may consider using the same mechanism of how
> exception history is stored.
> 
> Beside this, should we create new metrics for internal rescale to
> distinguish the metrics for internal restart?
> Created a jira for this https://issues.apache.org/jira/browse/FLINK-36871.
> Looking forward to the community's opinion?
> 
> 
> 
> 
> 
> 
> 
> 
> On Sat, Dec 7, 2024 at 7:08 AM Yuepeng Pan <panyuep...@apache.org> wrote:
> 
> > Thanks Eric for the comments !
> >
> > Sorry for some confusion in your reading. As you see, there are some
> > things that are still in discussion, so I haven't updated some of the
> > current documentation or wiki.
> >
> >
> > > How/Where would we store the history of these scaling events?
> >
> > Please let me try to clarify the questions first in the email thread.
> >
> > - For how:
> > We plan to introduce a parameter
> > 'web.adaptive-scheduler.rescale-history-max'
> > to control the number of recent rescale events to be stored in memory,
> > and when a job fails to submit or reaches a terminal state,
> > to store the most recent events in an archive json to a location
> > where the history server can retrieve them.
> >
> > - For where:
> > Tend to to put it in the AdaptiveScheduler.
> >
> >
> > And I'd happy to hear more ideas about its.
> >
> >
> > Thank you very much for your comments,
> > I'll add the two items after confirming how to split the FLIP.
> >
> >
> > Best,
> > Yuepeng.
> >
> >
> >
> > On 2024/12/06 16:32:07 Eric Xiao wrote:
> > > Hi Yuepeng Pan,
> > >
> > > This would be a very useful feature to have in OSS. One question I have
> > > that wasn't outlined in the FLIP or I may have missed it was -
> > >
> > > How/Where would we store the history of these scaling events? Is there an
> > > existing pattern for storing such information in other parts of Flink
> > that
> > > we will be following (i.e. exceptions, savepoint/checkpoint history). I
> > can
> > > see a couple of options (in-memory, a caching service, DFS, etc.) for
> > > storing the history, each having their own set of tradeoffs.
> > >
> > > Best,
> > >
> > > Eric
> > >
> > > On Fri, Dec 6, 2024 at 5:23 AM Yuepeng Pan <panyuep...@apache.org>
> > wrote:
> > >
> > > > Thanks Rui for the comments!
> > > >
> > > >
> > > >
> > > > > I have a minor question for this: could we discuss 2 FLIPs together?
> > > > > I'm afraid the rest endpoints doesn't work as expected when we
> > discuss
> > > > > WebUI change in the future.
> > > > > Clarification: Organize related designs in different FLIP wikis and
> > > > > discuss them in different discussions. Together just means that
> > > > > two FLIPs can be carried out at the same time.
> > > > > After the design and discussion of both FLIPs are completed,
> > > > > we could start voting on each one.
> > > >
> > > >
> > > > It makes sense to me as there are some impacts between the two FLIPs
> > > > and it could help avoid missing some details.
> > > >
> > > > Let’s assume we proceed in this way, What do you think about using
> > > > FLIP-487 and its corresponding email thread to focus on discussing
> > > > the UI design to reduce ambiguity? Meanwhile, we can create a new
> > > > FLIP and email to discuss the features of REST and AdaptiveScheduler,
> > > > for example, with a title like: "Support AdaptiveScheduler record and
> > > > query the rescale history,"
> > > > or another more appropriate title.
> > > >
> > > > BWT, whatever the final conclusion is, I will adjust the content on the
> > > > wiki
> > > > instead of google doc after a reasonable conclusion,
> > > > and may subsequent discussions and updates be aligned with the FLIP on
> > the
> > > > wiki.
> > > >
> > > > Looking forward to more ideas about it.
> > > >
> > > > Best,
> > > > Yuepeng Pan
> > > >
> > > >
> > > > On 2024/12/06 03:44:24 Rui Fan wrote:
> > > > > Thanks Yuepeng for driving this proposal!
> > > > >
> > > > > It's definitely a great addition for Adaptive Scheduler.
> > > > > I see FLIP-487 has 2 docs, one is official wiki[1], one is google
> > doc[2].
> > > > > It's better to only use one doc to ensure the consistency, I prefer
> > > > > to only use FLIP-487 official wiki due to you have the wiki
> > permission.
> > > > >
> > > > > > I'm also wondering whether it would make sense to focus on the REST
> > > > > > endpoints in this FLIP and put the UI work in a separate FLIP.
> > WDYT?
> > > > > > Decreasing the scope would probably help handling the required
> > changes
> > > > >
> > > > > Thanks Matthias for the good suggestions, split it to 2 FLIPs makes
> > sense
> > > > > to me.
> > > > > I have a minor question for this: could we discuss 2 FLIPs together?
> > > > > I'm afraid the rest endpoints doesn't work as expected when we
> > discuss
> > > > > WebUI change in the future.
> > > > >
> > > > > Also, I found the current FLIP wiki[1] and doc[2] have deleted all
> > > > > information
> > > > > related to WebUI change. If so, the FLIP title should be changed as
> > well.
> > > > > Currently, the title is "FLIP-487: Show history of rescales in Web
> > UI for
> > > > > AdaptiveScheduler"
> > > > > but it seems FLIP-487 doesn't include any Web UI change.
> > > > >
> > > > > [1]
> > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
> > > > > [2]
> > > > >
> > > >
> > https://docs.google.com/document/d/1WrLBkSkYe2tBQ3j66gKHFr2OB0d1HuHKDrRVr6B8nkM/edit?usp=sharing
> > > > >
> > > > > Best,
> > > > > Rui
> > > > >
> > > > > On Wed, Dec 4, 2024 at 9:56 PM Yuepeng Pan <panyuep...@apache.org>
> > > > wrote:
> > > > >
> > > > > >
> > > > > > Thanks Matthias  a lot  for the comments and guidance.
> > > > > >
> > > > > > >Can we reorganize the draft? Right now, we have some (for
> > > > RescaleEvent,
> > > > > > >Required/AcquiredParallelism) schema defined in the "Proposed
> > Changes"
> > > > > > section
> > > > > > >and some other schema under "Public Interfaces". It would be nice
> > to
> > > > have
> > > > > > this
> > > > > > >more organized. Just as a suggestion: In the end the proposed
> > changes
> > > > > > should list
> > > > > > >the different REST endpoints you want to introduce (including the
> > > > > > corresponding
> > > > > > >schemas for request and response).
> > > > > >
> > > > > > Thanks for the comments. I’ve made some structural adjustments and
> > > > content
> > > > > > modifications to help clarify certain issues. Since some questions
> > are
> > > > > > still under discussion, I plan to update the document once we
> > reach a
> > > > > > consensus.
> > > > > >
> > > > > > > I'm also wondering whether it would make sense to focus on the
> > REST
> > > > > > > endpoints in this FLIP and put the UI work in a separate FLIP.
> > WDYT?
> > > > > > Decreasing
> > > > > > > the scope would probably help handling the required changes.
> > > > > >
> > > > > > Nice proposal ! That will help us better focus our attention and
> > > > > > concentrate on the main target.
> > > > > >
> > > > > > > Have you considered adding the onChange event timestamp for a
> > rescale
> > > > > > event
> > > > > > > as well? We introduced a separation of the job requirements
> > change
> > > > event
> > > > > > > and the actual rescale execution in FLIP-461 [1]. It might be
> > worth
> > > > > > > documenting the time when a change was monitored for the first
> > time
> > > > that
> > > > > > > triggered the rescale. WDYT?
> > > > > >
> > > > > > Thank you very much for your comments.As described in the
> > document, the
> > > > > > original design had many abstract concepts that were relatively
> > > > simple. The
> > > > > > information recorded for each rescale event mainly reflected the
> > > > current
> > > > > > state, rather than the entire historical state of the adaptive
> > > > scheduler
> > > > > > throughout the rescale process.
> > > > > > IIUC, when we add the onChange event timestamp for rescale events,
> > each
> > > > > > rescale event will capture the state changes of the adaptive
> > scheduler
> > > > at
> > > > > > various stages (these states would form a collection similar to the
> > > > set of
> > > > > > all possible states of the adaptive scheduler) along with the
> > > > corresponding
> > > > > > timestamps. These states would essentially form an ordered
> > collection.
> > > > > > Compared to the original design, this would undoubtedly make
> > internal
> > > > > > information more transparent, which I think is very meaningful.
> > > > > > If I misunderstood, please correct me.
> > > > > >
> > > > > > >You're mentioning "comments" as a field of the RescaleEvent in
> > your
> > > > > > proposal.
> > > > > > >What's the use-case here? Where are these comments from? (update)
> > A
> > > > brief
> > > > > > talk
> > > > > > >with Yuepeng on that topic revealed that the field is supposed to
> > be
> > > > used
> > > > > > for
> > > > > > >errors that occurred during the rescale operation. My take on that
> > > > one:
> > > > > > >- We might want to reconsider the field name in that case (maybe
> > > > > > >errors_during_rescale?). "comments" seems to be quite generic.
> > > > > > > - Additionally, shouldn't we make this a list of errors rather
> > than a
> > > > > > String field?
> > > > > >
> > > > > > Sorry for not fully explaining the purpose of this field earlier.
> > The
> > > > > > intention behind adding this field is to highlight special
> > > > circumstances
> > > > > > during a rescale, such as exceptions caused by insufficient
> > resources
> > > > or
> > > > > > situations where a new request overrides the current one, forcing
> > us to
> > > > > > abandon the ongoing rescale operation and reallocate resources
> > based
> > > > on the
> > > > > > new request. In such cases, it would be valuable to record what
> > exactly
> > > > > > happened internally. That said, I completely agree with the point
> > that
> > > > > > "comments" might seem overly broad.
> > > > > > Maybe we need to address a few key questions here:
> > > > > >
> > > > > > 1. Is it necessary to retain a "comments"-like field to capture
> > such
> > > > > > information?
> > > > > > 2. Should error-related information also be retained?
> > > > > > 3. If retained, where should the "comments" (if applicable) and
> > error
> > > > > > information fields be located?
> > > > > >     - Should they be tied to a specific adaptive-scheduler-state
> > > > window?
> > > > > >     - Or should they be directly included in the rescale event
> > message?
> > > > > > 4. Should these fields be implemented as lists?
> > > > > >
> > > > > > For the first and second items, I prefer to retain both types of
> > > > > > information because they can provide transparent details to the
> > user.
> > > > > > For the third and fourth items, I prefer to associate these two
> > fields
> > > > as
> > > > > > lists with specific adaptive-scheduler state windows. This approach
> > > > could
> > > > > > help organize the information better and link it to relevant
> > events in
> > > > the
> > > > > > rescale process.
> > > > > >
> > > > > > I would greatly appreciate hearing more suggestions and ideas!
> > > > > >
> > > > > > > - How certain are we that we can associate errors to the actual
> > > > rescale
> > > > > > > operation and rather than the error being caused by something
> > else?
> > > > > >
> > > > > > In my limited reading, it seems that only exceptions related to
> > > > resource
> > > > > > shortage during parallelism allocation can be classified into this
> > > > type of
> > > > > > exception. Other exceptions can be categorized outside this scope.
> > > > > >
> > > > > > > But there is no "attempt number" mentioned in the RescaleEvent
> > > > schema.
> > > > > >
> > > > > > Please let me try to add to this question:
> > > > > > - Each rescale requirement corresponds to a
> > > > resourceRequirementsEpochID.
> > > > > > - For the same rescale requirement (resourceRequirementsEpochID),
> > > > multiple
> > > > > > rescale attempts may occur to here. We treat each rescale under the
> > > > current
> > > > > > resource requirements as a rescale attempt. That is, for each
> > rescale
> > > > > > attempt on the same rescale requirement
> > (resourceRequirementsEpochID),
> > > > the
> > > > > > rescale attempt id will increment by 1. When the
> > > > > > resourceRequirementsEpochID is changed, the rescale attempt will be
> > > > reset.
> > > > > > - Based on these two points, whenever a rescale attempt happens,
> > the
> > > > > > rescaleID will increment globally.
> > > > > >
> > > > > > I hope this may clarify the logic and approach. Please let me know
> > > > what’s
> > > > > > your opinion.
> > > > > >
> > > > > > > Additionally, what is the ID based on? Do we start from 0 and
> > just
> > > > > > increment?
> > > > > > > Or do we want to have a mechanism that ensures that the IDs are
> > also
> > > > > > > unique/monotonically increasing after JobManager failovers?
> > > > > >
> > > > > > To be honest, I expect initially it to increment from 0 every time
> > a
> > > > job
> > > > > > is started since the rescale ID is not strongly tied to the state.
> > If
> > > > we
> > > > > > can ensure that the ID remains unique and increase monotonically
> > even
> > > > after
> > > > > > a JobManager failover, this would make the rescale ID more
> > consistent,
> > > > > > right? If so, that makes sense to me. Please let me know your
> > thoughts.
> > > > > >
> > > > > > > For the parallelism schema: I might be misreading the draft here
> > but
> > > > > > you're proposing to use the subtask
> > > > > > > name as the ID to refer to the JobVertex? That the name might
> > become
> > > > > > quite long. What about using the
> > > > > > > JobVertexID here. That would be also more aligned to how the
> > > > parallelism
> > > > > > is represented by the
> > > > > > > /jobs/<job-id>/resource-requirements endpoint. If we want to add
> > the
> > > > > > task name for readability purposes, we
> > > > > > > can still add this one as a taskName field to the
> > > > > > Required/AcquiredParallelism schema.
> > > > > >
> > > > > > Good proposal! I did consider using JobVertexID, but for ease of
> > use, I
> > > > > > ended up changing it to taskName. I'm wondering if we could use
> > both
> > > > > > JobVertexID and taskName. Could we provide a parameter to control
> > the
> > > > > > maximum length of taskName? In this way, we could ensure the
> > > > uniqueness of
> > > > > > JobVertexID while maintaining a certain level of readability for
> > > > taskName.
> > > > > > WDYTA?
> > > > > >
> > > > > > > Status field:
> > > > > > > - What is the meaning of "TRYING"? I guess, we're more or less
> > using
> > > > the
> > > > > > > AdaptiveScheduler states here, aren't we? Can't we align/stick
> > to the
> > > > > > > naming that's defined in the AdaptiveScheduler state?
> > > > > >
> > > > > > As mentioned above, this type of state in the design is used to
> > > > describe
> > > > > > the current status of a rescale event. It is not equivalent to the
> > > > state of
> > > > > > the adaptive scheduler; rather, it represents a window just after
> > > > receiving
> > > > > > an update request but before reaching the next defined state.
> > > > > >
> > > > > > > Can't we align/stick to the naming that's defined in the
> > > > > > AdaptiveScheduler state?
> > > > > >
> > > > > > I previously conducted some research on the states of the
> > > > > > AdaptiveScheduler [1]. Plase let me first try to define the
> > necessary
> > > > > > conditions for a Rescale Event:
> > > > > >
> > > > > > - The process starting from WaitingForResources /
> > > > CreatingExecutionGraph
> > > > > > and ending at Executing / FINISHED is considered one rescale.
> > > > > > - If the process starts from WaitingForResources /
> > > > CreatingExecutionGraph
> > > > > > and ends at Executing, this can be considered a successful rescale.
> > > > > > - If the process starts from WaitingForResources /
> > > > CreatingExecutionGraph
> > > > > > and transitions directly to FINISHED for other reasons, it can be
> > > > > > considered a failed rescale.
> > > > > > Based on these definitions, the adaptive-scheduler states that can
> > be
> > > > > > included in the rescale process seem to be Created,
> > > > CreatingExecutionGraph,
> > > > > > WaitingForResources, Executing, StopWithSavepoint, and Restarting.
> > > > > >
> > > > > > If everything is aligned and we treat the AdaptiveScheduler states
> > as
> > > > > > equivalent to the RescaleEvent states, some of the states might
> > seem a
> > > > bit
> > > > > > odd. For example, after a successful rescale, the final state of
> > the
> > > > > > rescale event would be Executing(as same as adaptive scheduler).
> > Would
> > > > we
> > > > > > consider this situation reasonable? If so, Aligning the states
> > would
> > > > make
> > > > > > sense to me.
> > > > > > Additionally, if the above definition holds, each rescale would
> > need a
> > > > > > "final state" or “current state” field to indicate its status.
> > > > > >
> > > > > > Maybe it’s important to define and reach consensus on the rescale
> > event
> > > > > > and its start and end boundaries.
> > > > > > Please correct me if I’m wrong. Thank you very much!
> > > > > >
> > > > > >
> > > > > > > Do we really need a new REST endpoint for the configuration?
> > Can't we
> > > > > > get
> > > > > > > the provided information already from the existing configuration
> > > > > > endpoint?
> > > > > >
> > > > > > Thank you for your comments. Sorry for not finding a suitable
> > existing
> > > > > > REST interface to reuse.
> > > > > > But there’s a part of ‘/jobs/:jobid/config‘ that could be reused
> > at the
> > > > > > backend side.
> > > > > > Another reason for introducing a new interface is to directly[2]
> > > > retrieve
> > > > > > the latest values from the Adaptive Scheduler, as we cannot
> > guarantee
> > > > that
> > > > > > the configuration values used in the Adaptive Scheduler are always
> > the
> > > > ones
> > > > > > specified in the configuration file. For example, some parameter
> > > > values may
> > > > > > be overwritten, which could change them. WDYTA?
> > > > > >
> > > > > > > For the summary endpoint: I see similarities to the checkpoint
> > > > summary
> > > > > > > here. Not sure whether you already considered that but would it
> > make
> > > > > > sense
> > > > > > > to align the field names in some way to have a consistent
> > > > look-and-feel?
> > > > > > > I'm also wondering whether it makes sense to align the schema to
> > have
> > > > > > > something like the latest rescale, failed rescale, ...
> > > > > >
> > > > > > Thank you for the reminder. Yes, I have considered it. For
> > example, I
> > > > > > tried to align the request information in the REST interface with
> > the
> > > > > > checkpoint summary, as seen in this document[2]. However, I haven't
> > > > > > strictly followed the same interface style and fields as the
> > checkpoint
> > > > > > REST interface, and instead used a simpler list (sorry, I should
> > have
> > > > > > shared these considerations in the draft earlier to give developers
> > > > more
> > > > > > context). If needed, I think it’s a great choice.
> > > > > >
> > > > > > By the way, perhaps we should also consider the archive feature
> > for the
> > > > > > rescale history, to support viewing this information on the history
> > > > server.
> > > > > >
> > > > > > Thank you  very much.
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > >
> > https://drive.google.com/file/d/1FE17cAfbUXypYBp0soaVLu5oqn39BoKI/view?usp=sharing
> > > > > > [2]
> > > > > >
> > > >
> > https://docs.google.com/document/d/1WrLBkSkYe2tBQ3j66gKHFr2OB0d1HuHKDrRVr6B8nkM/edit?disco=AAABSsVvI6Q
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 2024/12/02 08:59:16 Matthias Pohl wrote:
> > > > > > > Hi Yuepeng,
> > > > > > > thanks for the proposal. Having a way to see the history of
> > rescales
> > > > is a
> > > > > > > nice feature, I guess. I went over the draft and have a few
> > > > questions:
> > > > > > >
> > > > > > > Can we reorganize the draft? Right now, we have some (for
> > > > RescaleEvent,
> > > > > > > Required/AcquiredParallelism) schema defined in the "Proposed
> > > > Changes"
> > > > > > > section and some other schema under "Public Interfaces". It
> > would be
> > > > nice
> > > > > > > to have this more organized.
> > > > > > > Just as a suggestion: In the end the proposed changes should
> > list the
> > > > > > > different REST endpoints you want to introduce (including the
> > > > > > corresponding
> > > > > > > schemas for request and response).
> > > > > > > ---
> > > > > > > I'm also wondering whether it would make sense to focus on the
> > REST
> > > > > > > endpoints in this FLIP and put the UI work in a separate FLIP.
> > WDYT?
> > > > > > > Decreasing the scope would probably help handling the required
> > > > changes.
> > > > > > > ---
> > > > > > > Have you considered adding the onChange event timestamp for a
> > rescale
> > > > > > event
> > > > > > > as well? We introduced a separation of the job requirements
> > change
> > > > event
> > > > > > > and the actual rescale execution in FLIP-461 [1]. It might be
> > worth
> > > > > > > documenting the time when a change was monitored for the first
> > time
> > > > that
> > > > > > > triggered the rescale. WDYT?
> > > > > > > ---
> > > > > > > You're mentioning "comments" as a field of the RescaleEvent in
> > your
> > > > > > > proposal. What's the use-case here? Where are these comments
> > from?
> > > > > > >
> > > > > > > (update)
> > > > > > > A brief talk with Yuepeng on that topic revealed that the field
> > is
> > > > > > supposed
> > > > > > > to be used for errors that occurred during the rescale
> > operation. My
> > > > take
> > > > > > > on that one:
> > > > > > > - We might want to reconsider the field name in that case (maybe
> > > > > > > errors_during_rescale?). "comments" seems to be quite generic.
> > > > > > > - Additionally, shouldn't we make this a list of errors rather
> > than a
> > > > > > > String field?
> > > > > > > - How certain are we that we can associate errors to the actual
> > > > rescale
> > > > > > > operation and rather than the error being caused by something
> > else?
> > > > > > > ---
> > > > > > > In the schema of the RescaleEvent you describe the three
> > different
> > > > > > > ID/numbers in the following way:
> > > > > > >
> > > > > > > > The ‘id’ is automatically incremental, The rescaleAttemptId is
> > > > > > generated
> > > > > > > > based on one specified resource-requirement and the attempt
> > number
> > > > is
> > > > > > > > generated based on rescaleAttemptId.
> > > > > > >
> > > > > > >  But there is no "attempt number" mentioned in the RescaleEvent
> > > > schema.
> > > > > > > Additionally, what is the ID based on? Do we start from 0 and
> > just
> > > > > > > increment? Or do we want to have a mechanism that ensures that
> > the
> > > > IDs
> > > > > > are
> > > > > > > also unique/monotonically increasing after JobManager failovers?
> > > > > > > ---
> > > > > > > For the parallelism schema: I might be misreading the draft here
> > but
> > > > > > you're
> > > > > > > proposing to use the subtask name as the ID to refer to the
> > > > JobVertex?
> > > > > > That
> > > > > > > the name might become quite long. What about using the
> > JobVertexID
> > > > here.
> > > > > > > That would be also more aligned to how the parallelism is
> > > > represented by
> > > > > > > the /jobs/<job-id>/resource-requirements endpoint. If we want to
> > add
> > > > the
> > > > > > > task name for readability purposes, we can still add this one as
> > a
> > > > > > taskName
> > > > > > > field to the Required/AcquiredParallelism schema.
> > > > > > > ---
> > > > > > > Status field:
> > > > > > > - What is the meaning of "TRYING"? I guess, we're more or less
> > using
> > > > the
> > > > > > > AdaptiveScheduler states here, aren't we? Can't we align/stick
> > to the
> > > > > > > naming that's defined in the AdaptiveScheduler state?
> > > > > > > ---
> > > > > > > Do we really need a new REST endpoint for the configuration?
> > Can't
> > > > we get
> > > > > > > the provided information already from the existing configuration
> > > > > > endpoint?
> > > > > > > That said, I still find it useful to have a config tab in the UI
> > at
> > > > the
> > > > > > end.
> > > > > > > ---
> > > > > > > For the summary endpoint: I see similarities to the checkpoint
> > > > summary
> > > > > > > here. Not sure whether you already considered that but would it
> > make
> > > > > > sense
> > > > > > > to align the field names in some way to have a consistent
> > > > look-and-feel?
> > > > > > > I'm also wondering whether it makes sense to align the schema to
> > have
> > > > > > > something like latest rescale, failed rescale, ...
> > > > > > >
> > > > > > > Best,
> > > > > > > Matthias
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler
> > > > > > >
> > > > > > > On Mon, Nov 25, 2024 at 11:24 AM yuanfeng hu <
> > yuanf...@apache.org>
> > > > > > wrote:
> > > > > > >
> > > > > > > > +1, I think this feature is very useful for adaptive scheduler.
> > > > > > > >
> > > > > > > > Yuepeng Pan <panyuep...@apache.org> 于2024年11月22日周五 18:38写道：
> > > > > > > >
> > > > > > > > > Hi community,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Currently, the Adaptive Scheduler already supports the REST
> > API
> > > > > > > > >
> > > > > > > > > to manually adjust[1] the parallelism of jobs, which
> > enhances the
> > > > > > > > >
> > > > > > > > > functionality of the Adaptive Scheduler.
> > > > > > > > >
> > > > > > > > > However, Adaptive Scheduler doesn't support displaying or
> > > > tracing the
> > > > > > > > > rescale history yet[2].
> > > > > > > > >
> > > > > > > > > This makes it inconvenient for users/devs to quickly obtain
> > some
> > > > > > internal
> > > > > > > > >
> > > > > > > > > information about the rescale history of the Adaptive
> > Scheduler.
> > > > > > > > >
> > > > > > > > > And showing the history of rescale events of
> > AdaptiveScheduler in
> > > > > > the web
> > > > > > > > >
> > > > > > > > > UI is very useful for users to make the next step for jobs.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Therefore, I created the FLIP-487[3] doc to support
> > > > > > > > >
> > > > > > > > > 'Show history of rescales in Web UI for AdaptiveScheduler'.
> > > > > > > > >
> > > > > > > > > Please refer to the google document[3] for more details
> > > > > > > > >
> > > > > > > > > about the proposed design and implementation.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Looking forward to any feedback and opinions on this
> > proposal.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > > > > > > >
> > > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-22258
> > > > > > > > >
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://docs.google.com/document/d/1WrLBkSkYe2tBQ3j66gKHFr2OB0d1HuHKDrRVr6B8nkM/edit?tab=t.0
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thank you very much.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > >
> > > > > > > > > Regards.
> > > > > > > > >
> > > > > > > > > Yuepeng Pan
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best,
> > > > > > > > Yuanfeng
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-487: Show history of rescales in Web UI for AdaptiveScheduler

Reply via email to