Re: [DISCUSS] FLIP-487: Show history of rescales in Web UI for AdaptiveScheduler

Matthias Pohl Wed, 31 Dec 2025 06:34:03 -0800

Thanks for the quick response. I added my responses inline. PTAL

Best,
Matthias


On Mon, 22 Dec 2025, 01:02 Yuepeng Pan, <[email protected]> wrote:

> Hi, Matthias, I'm glad to see that email.
> And thank you very much for your review and comments.
>
> To facilitate reading and discussion,
> I have grouped related questions together as much as possible
> when organizing my responses to your comments,
> and I hope this will not cause any inconvenience.
>
>
> 1. Reference typo & format.
>
>
> > Adaptive Scheduler will support record and query the rescale history
> in[2]
> > Shouldn't it have refer to reference #3, i.e. FLIP-495?
> > nit: In the wiki, we do not need to add the references but use links with
> > proper link text (e.g. in the motivation paragraph). That should improve
> > readability.
>
> Thanks for the catching and suggestions. That makes sense to me.
> I corrected and reformatted the citation errors
> and reference formats you mentioned throughout the entire document.
>
>
> 2. Schemas:
>
> a. schema of the response for /jobs/overview
>
> > extended schema of the response for /jobs/overview
>
> > The extract of the schema extension is not precise: We should show, that
> > the new fields are added to the item type
> > (urn:jsonschema:org:apache:flink:runtime:messages:webmonitor:JobDetails).
> > About the field name formatting of "job-type": We still do not have this
> > one included in the code convention. But AFAIS, we usually follow
> camelCase
> > format rather kebab-casing. But especially the Job overview uses both
> > already.
>
> Thanks for the comments.
> That sounds good to me.
> I have updated the corresponding accompanying changes to the JobDetails
> class.
>
> b. schema of response for /jobs/:jobid/rescales
>
> > Schema of response for /jobs/:jobid/rescales
> > I noticed that also for the other JSON schemas, we jump between formats
> > (even introducing snake_casing). Let's unify them and stick to camelCase.
> > WDYT?
>
> Nice idea!
> Considering compatibility and the workload associated with this FLIP,
> the existing fields are not modified in the current FLIP,
> only the newly introduced fields are named
> following the camelCase naming convention.
> And I updated the lines about schemas that need to change.


> Regarding the naming style changes for all fields in schemas that are
> modified (as opposed to newly introduced) within this FLIP, do we need a
> new FLIP to address and unify such work?
> This way, the new FLIP would focus solely on this type of task.
> What do you think about it ?
>

You are right. Existing fields shouldn't be modified. Only for new ones, we
can make sure to not introduce more inconsistencies.

In general, the problem is that the JSON formatting is not specified in the
coding guidelines. That's why it comes with no surprise that these
formatting inconsistencies exist. We would need to start a discussion on
updating the Flink coding guidelines first. Only afterwards, we could fix
the formatting.

Such a change would need to be rolled out as part of a major version (e.g.
3.0) only, though.


> c. For "summary.rescaleCounts"
>
> > For "summary.rescaleCounts", we might not need to add the "_rescales"
> > suffix to the record fields since the parent indicates already that all
> of
> > the fields are rescale counts. We, therefore, could use "inProgress",
> > "ignored", "completed", "failed".
>
> Yes, this indeed makes the expression more concise and to the point.
> I updated this part.
>
> > Do we see value in adding the total
> > value? That could be easily calculated using the other four metrics.
> Hence,
> > I think we can consider it as being redundant and remove it.
>
> This is acceptable, as the one of differences lies in
> whether the total value is calculated on the FE side or on the backend.
>
> d. rescalesDurationStats/rescales_duration_stats(the previous edition)
>
> > "rescales_duration_stats"
> > For all the "durationStats"? Can we add the time unit to make things
> > clearer, e.g. "rescalesDurationStats" becomes
> > "rescalesDurationStatsInMillis"? ...same applies to the timestamps
>
> Good idea~.
> I update the description of all attributes about timestamps.
> Please help take a look!
>
> e. ignoredRescalesDurationStats/ignored_rescales_duration_stats(the
> previous edition)
>
> > "ignored_rescales_duration_stats"
> > Are the stats useful for rescales which were actually not executed?
>
> Answering this question may be a bit difficult for me.
> In theory, since rescale operations of the Ignored type can occur,
> it is reasonable to include them in the statistics—at least
> from the perspective of having a complete set of dimensions.
> In addition, I'm not certain whether users truly do not care
> about statistics for this type of data.
> Therefore, I kept it in the initial design document.
> If you think it is unnecessary to retain this data,
> we can exclude Ignored rescale types from the duration statistics.
> I would appreciate your experience and opinion on this.


Fair enough.

f. the durationInMillis attribute.


> > duration
> > Rescale details already contain the start and end time. Adding the
> duration
> > here shouldn't be necessary.
>
> If the frontend page does not involve overly complex display logic,
> adding an additional durationInMillis field here should be unnecessary.
>

Just to clarify: I don't suggest removing the duration information from the
web UI. It's only obsolete in the REST API because it can be calculated on
the client side.


>
> 3. UI
>
> a. Rescale History UI(related to 'durationInMillis' attribute)
>
> > Rescale History UI
> > The history looks nice. What making the duration of the inProgress
> rescales
> > dynamic, i.e. counting the seconds up from the start time? Keeping the NA
> > is also fine if the dynamic approach is too complicated.
>
> In my limited reading,
> this is feasible from an implementation perspective,
> though it may require some adjustments.
> If we remove the durationInMillis field from rescale,
> the frontend would need to perform some additional processing when
> displaying the data.
> For example:
> rescale{terminalState=inProgress, startTimestampInMillis=1,
> endTimestampInMillis=null, durationInMillis=3}
> If we keep the durationInMillis field, the frontend would almost not need
> any logic and could simply display the data as is.
> If we do not keep the durationInMillis field, the frontend would need to do
> two things when rendering:
>   - Calculate durationInMillis based on startTimestampInMillis and
> endTimestampInMillis
>   - When displaying records with terminalState = inProgress, show
> endTimestampInMillis as null
>
> Similarly, for handling durationInMillis in schedulerState,
> I‘m not sure whether such scenarios would arise,
> although we have not yet considered
> whether this data should be displayed in the same way as
> Rescale.durationInMillis.
> Although the difference is small,
> it is worth clarifying so that we can better evaluate the decision.
>
> Therefore, please let me know your thoughts on
> - whether we should keep the durationInMillis field for both Rescale and
> schedulerState in the schema
> - Show N.A in the duration of InProgress Rescale and remove the
> durationInMillis in the related sub-json.
> - Or something reasonable from you.
>

As mentioned in 2.f), I would remove the duration and calculate it
dynamically in the client code. It shouldn't be a too complex operation and
allows us to keep the duration dynamic for rescales in progress.


> b. Rescale Overview UI.
>
> > Rescale Overview UI
> > The screenshot shows "Acquired profile" twice for the slot (based on the
> > details UI, the first one is supposed to be "required").
>
> Sorry for the typo. I corrected it.
>
> > Additionally, in
> > FLIP-495 we agreed on four metrics: previous, sufficient, desired and
> > acquired resources (for parallelism and profile). Should we use those in
> > the UI as well?
>
> Okay. Updated it in the related UI draft pages.
>
> > We might want to add tooltips to the headers as well to
> > add a description for each of the metrics.
>
> > Could we add tooltips to the headers of the rescale overview to describe
> the different IDs?
>
> Yes, the suggestion is reasonable.
> And I added the description of hint messages about some core header
> attributes after the corresponding UI draft pages.
> Looking forward to your opinion.
>
> 4. The new added items by me:
> I have added notes after some sections of the core UI pages regarding
> limiting the displayed length of UUID-type identifiers and issues related
> to task names.
>
> I'd greatly appreciate any suggestions you may have.
>
>
> Best regards,
> Yuepeng Pan
>
>
> Matthias Pohl <[email protected]> 于2025年12月18日周四 18:08写道：
>
> > Hi Yuepeng,
> > I finally found some time to look into that FLIP again. Sorry for the
> > delay. Thanks for working on this topic and pushing it. Here are a few
> more
> > comments on the current state of FLIP-487:
> >
> > Adaptive Scheduler will support record and query the rescale history
> in[2].
> >
> > Shouldn't it have refer to reference #3, i.e. FLIP-495?
> >
> > nit: In the wiki, we do not need to add the references but use links with
> > proper link text (e.g. in the motivation paragraph). That should improve
> > readability.
> >
> > extended schema of the response for /jobs/overview
> >
> > The extract of the schema extension is not precise: We should show, that
> > the new fields are added to the item type
> > (urn:jsonschema:org:apache:flink:runtime:messages:webmonitor:JobDetails).
> > About the field name formatting of "job-type": We still do not have this
> > one included in the code convention. But AFAIS, we usually follow
> camelCase
> > format rather kebab-casing. But especially the Job overview uses both
> > already.
> >
> > Could we add tool tips to the headers of the rescale overview to describe
> > the different IDs?
> >
> > Schema of response for /jobs/:jobid/rescales
> >
> > I noticed that also for the other JSON schemas, we jump between formats
> > (even introducing snake_casing). Let's unify them and stick to camelCase.
> > WDYT?
> >
> > For "summary.rescaleCounts", we might not need to add the "_rescales"
> > suffix to the record fields since the parent indicate already that all of
> > the fields are rescale counts. We, therefore, could use "inProgress",
> > "ignored", "completed", "failed". Do we see value in adding the total
> > value? That could be easily calculated using the other four metrics.
> Hence,
> > I think we can consider it as being redundant and remove it.
> >
> > "rescales_duration_stats"
> >
> > For all the "durationStats"? Can we add the time unit to make things
> > clearer, e.g. "rescalesDurationStats" becomes
> > "rescalesDurationStatsInMillis"? ...same applies to the timestamps
> >
> > "ignored_rescales_duration_stats"
> >
> > Are the stats useful for rescales which were actually not executed?
> >
> > duration
> >
> > Rescale details already contain the start and end time. Adding the
> duration
> > here shouldn't be necessary.
> >
> > Rescale Overview UI
> >
> >
> > The screenshot shows "Acquired profile" twice for the slot (based on the
> > details UI, the first one is supposed to be "required"). Additionally, in
> > FLIP-495 we agreed on four metrics: previous, sufficient, desired and
> > acquired resources (for parallelism and profile). Should we use those in
> > the UI as well? We might want to add tool tips to the headers as well to
> > add a description for each of the metrics.
> >
> >  Rescale History UI
> >
> > The history looks nice. What making the duration of the inProgress
> rescales
> > dynamic, i.e. counting the seconds up from the start time? Keeping the NA
> > is also fine if the dynamic approach is too complicated.
> >
> > Best,
> > Matthias
> >
> > On Wed, Nov 5, 2025 at 11:24 AM Yuepeng Pan <[email protected]>
> wrote:
> >
> > > Bumping this thread. Thanks!
> > >
> > > Best regards,
> > > Yuepeng Pan
> > >
> > >
> > >
> > > On 2025/09/02 15:41:07 Yuepeng Pan wrote:
> > > > Hi, community.
> > > >
> > > >
> > > > At present, FLIP-495[1][2] has gone through a new round of
> discussions
> > > and a preliminary general consensus has been reached, which provides
> the
> > > necessary premise for the discussion of the current FLIP-487[3].
> > > >
> > > >
> > > > Therefore, I would like to resume the discussion on the current FLIP.
> > > >
> > > > The version of the current FLIP mainly covers and has completed the
> > > following two aspects of design:
> > > > - The REST API design for querying rescale history information
> > > > - The Web UI design for showing rescale history information
> > > >
> > > >
> > > > Looking forward to your comments and suggestions.
> > > >
> > > >
> > > > [1] https://lists.apache.org/thread/t3r9wdd5gpbqnvzw35kb3wb3d9brpnon
> > > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history
> > > > [3]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler
> > > >
> > > >
> > > > Best regards,
> > > > Yuepeng Pan
> > > >
> > > >
> > > > ---- Replied Message ----
> > > > | From | Matthias Pohl<[email protected]> |
> > > > | Date | 12/2/2024 16:59 |
> > > > | To | <[email protected]> |
> > > > | Subject | Re: [DISCUSS] FLIP-487: Show history of rescales in Web
> UI
> > > for AdaptiveScheduler |
> > > > Hi Yuepeng,
> > > > thanks for the proposal. Having a way to see the history of rescales
> > is a
> > > > nice feature, I guess. I went over the draft and have a few
> questions:
> > > >
> > > > Can we reorganize the draft? Right now, we have some (for
> RescaleEvent,
> > > > Required/AcquiredParallelism) schema defined in the "Proposed
> Changes"
> > > > section and some other schema under "Public Interfaces". It would be
> > nice
> > > > to have this more organized.
> > > > Just as a suggestion: In the end the proposed changes should list the
> > > > different REST endpoints you want to introduce (including the
> > > corresponding
> > > > schemas for request and response).
> > > > ---
> > > > I'm also wondering whether it would make sense to focus on the REST
> > > > endpoints in this FLIP and put the UI work in a separate FLIP. WDYT?
> > > > Decreasing the scope would probably help handling the required
> changes.
> > > > ---
> > > > Have you considered adding the onChange event timestamp for a rescale
> > > event
> > > > as well? We introduced a separation of the job requirements change
> > event
> > > > and the actual rescale execution in FLIP-461 [1]. It might be worth
> > > > documenting the time when a change was monitored for the first time
> > that
> > > > triggered the rescale. WDYT?
> > > > ---
> > > > You're mentioning "comments" as a field of the RescaleEvent in your
> > > > proposal. What's the use-case here? Where are these comments from?
> > > >
> > > > (update)
> > > > A brief talk with Yuepeng on that topic revealed that the field is
> > > supposed
> > > > to be used for errors that occurred during the rescale operation. My
> > take
> > > > on that one:
> > > > - We might want to reconsider the field name in that case (maybe
> > > > errors_during_rescale?). "comments" seems to be quite generic.
> > > > - Additionally, shouldn't we make this a list of errors rather than a
> > > > String field?
> > > > - How certain are we that we can associate errors to the actual
> rescale
> > > > operation and rather than the error being caused by something else?
> > > > ---
> > > > In the schema of the RescaleEvent you describe the three different
> > > > ID/numbers in the following way:
> > > >
> > > > The ‘id’ is automatically incremental, The rescaleAttemptId is
> > generated
> > > > based on one specified resource-requirement and the attempt number is
> > > > generated based on rescaleAttemptId.
> > > >
> > > > But there is no "attempt number" mentioned in the RescaleEvent
> schema.
> > > > Additionally, what is the ID based on? Do we start from 0 and just
> > > > increment? Or do we want to have a mechanism that ensures that the
> IDs
> > > are
> > > > also unique/monotonically increasing after JobManager failovers?
> > > > ---
> > > > For the parallelism schema: I might be misreading the draft here but
> > > you're
> > > > proposing to use the subtask name as the ID to refer to the
> JobVertex?
> > > That
> > > > the name might become quite long. What about using the JobVertexID
> > here.
> > > > That would be also more aligned to how the parallelism is represented
> > by
> > > > the /jobs/<job-id>/resource-requirements endpoint. If we want to add
> > the
> > > > task name for readability purposes, we can still add this one as a
> > > taskName
> > > > field to the Required/AcquiredParallelism schema.
> > > > ---
> > > > Status field:
> > > > - What is the meaning of "TRYING"? I guess, we're more or less using
> > the
> > > > AdaptiveScheduler states here, aren't we? Can't we align/stick to the
> > > > naming that's defined in the AdaptiveScheduler state?
> > > > ---
> > > > Do we really need a new REST endpoint for the configuration? Can't we
> > get
> > > > the provided information already from the existing configuration
> > > endpoint?
> > > > That said, I still find it useful to have a config tab in the UI at
> the
> > > end.
> > > > ---
> > > > For the summary endpoint: I see similarities to the checkpoint
> summary
> > > > here. Not sure whether you already considered that but would it make
> > > sense
> > > > to align the field names in some way to have a consistent
> > look-and-feel?
> > > > I'm also wondering whether it makes sense to align the schema to have
> > > > something like latest rescale, failed rescale, ...
> > > >
> > > > Best,
> > > > Matthias
> > > >
> > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler
> > > >
> > > > On Mon, Nov 25, 2024 at 11:24 AM yuanfeng hu <[email protected]>
> > > wrote:
> > > >
> > > > +1, I think this feature is very useful for adaptive scheduler.
> > > >
> > > > Yuepeng Pan <[email protected]> 于2024年11月22日周五 18:38写道：
> > > >
> > > > Hi community,
> > > >
> > > >
> > > >
> > > >
> > > > Currently, the Adaptive Scheduler already supports the REST API
> > > >
> > > > to manually adjust[1] the parallelism of jobs, which enhances the
> > > >
> > > > functionality of the Adaptive Scheduler.
> > > >
> > > > However, Adaptive Scheduler doesn't support displaying or tracing the
> > > > rescale history yet[2].
> > > >
> > > > This makes it inconvenient for users/devs to quickly obtain some
> > internal
> > > >
> > > > information about the rescale history of the Adaptive Scheduler.
> > > >
> > > > And showing the history of rescale events of AdaptiveScheduler in the
> > web
> > > >
> > > > UI is very useful for users to make the next step for jobs.
> > > >
> > > >
> > > >
> > > >
> > > > Therefore, I created the FLIP-487[3] doc to support
> > > >
> > > > 'Show history of rescales in Web UI for AdaptiveScheduler'.
> > > >
> > > > Please refer to the google document[3] for more details
> > > >
> > > > about the proposed design and implementation.
> > > >
> > > >
> > > >
> > > >
> > > > Looking forward to any feedback and opinions on this proposal.
> > > >
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> > > >
> > > > [2] https://issues.apache.org/jira/browse/FLINK-22258
> > > >
> > > > [3]
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1WrLBkSkYe2tBQ3j66gKHFr2OB0d1HuHKDrRVr6B8nkM/edit?tab=t.0
> > > >
> > > >
> > > >
> > > >
> > > > Thank you very much.
> > > >
> > > >
> > > >
> > > >
> > > > Best,
> > > >
> > > > Regards.
> > > >
> > > > Yuepeng Pan
> > > >
> > > >
> > > >
> > > > --
> > > > Best,
> > > > Yuanfeng
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-487: Show history of rescales in Web UI for AdaptiveScheduler

Reply via email to