Hi, Matthias. Thank you for your review and Happy New Year!
a. About JSON schema: > You are right. Existing fields shouldn't be modified. Only for new ones, we > can make sure to not introduce more inconsistencies. > In general, the problem is that the JSON formatting is not specified in the > coding guidelines. That's why it comes with no surprise that these > formatting inconsistencies exist. We would need to start a discussion on > updating the Flink coding guidelines first. Only afterwards, we could fix > the formatting. > Such a change would need to be rolled out as part of a major version (e.g. > 3.0) only, though. Thanks for your confirmation & ideas. That sounds good to me! I’ve created a new Jira ticket[1] so that community contributors can track this new, independent piece of work. b. About the durationInMillis attribute Thanks for your response. I removed the durationInMillis from the corresponding json schema of REST API interfaces and added some required description on the reason about the deprecated 'durationInMillis'. Any input is appreciated! [1] https://issues.apache.org/jira/browse/FLINK-38853 Best regards, Yuepeng Pan Matthias Pohl <[email protected]> 于2025年12月31日周三 22:34写道: > Thanks for the quick response. I added my responses inline. PTAL > > Best, > Matthias > > On Mon, 22 Dec 2025, 01:02 Yuepeng Pan, <[email protected]> wrote: > > > Hi, Matthias, I'm glad to see that email. > > And thank you very much for your review and comments. > > > > To facilitate reading and discussion, > > I have grouped related questions together as much as possible > > when organizing my responses to your comments, > > and I hope this will not cause any inconvenience. > > > > > > 1. Reference typo & format. > > > > > > > Adaptive Scheduler will support record and query the rescale history > > in[2] > > > Shouldn't it have refer to reference #3, i.e. FLIP-495? > > > nit: In the wiki, we do not need to add the references but use links > with > > > proper link text (e.g. in the motivation paragraph). That should > improve > > > readability. > > > > Thanks for the catching and suggestions. That makes sense to me. > > I corrected and reformatted the citation errors > > and reference formats you mentioned throughout the entire document. > > > > > > 2. Schemas: > > > > a. schema of the response for /jobs/overview > > > > > extended schema of the response for /jobs/overview > > > > > The extract of the schema extension is not precise: We should show, > that > > > the new fields are added to the item type > > > > (urn:jsonschema:org:apache:flink:runtime:messages:webmonitor:JobDetails). > > > About the field name formatting of "job-type": We still do not have > this > > > one included in the code convention. But AFAIS, we usually follow > > camelCase > > > format rather kebab-casing. But especially the Job overview uses both > > > already. > > > > Thanks for the comments. > > That sounds good to me. > > I have updated the corresponding accompanying changes to the JobDetails > > class. > > > > b. schema of response for /jobs/:jobid/rescales > > > > > Schema of response for /jobs/:jobid/rescales > > > I noticed that also for the other JSON schemas, we jump between formats > > > (even introducing snake_casing). Let's unify them and stick to > camelCase. > > > WDYT? > > > > Nice idea! > > Considering compatibility and the workload associated with this FLIP, > > the existing fields are not modified in the current FLIP, > > only the newly introduced fields are named > > following the camelCase naming convention. > > And I updated the lines about schemas that need to change. > > > > Regarding the naming style changes for all fields in schemas that are > > modified (as opposed to newly introduced) within this FLIP, do we need a > > new FLIP to address and unify such work? > > This way, the new FLIP would focus solely on this type of task. > > What do you think about it ? > > > > You are right. Existing fields shouldn't be modified. Only for new ones, we > can make sure to not introduce more inconsistencies. > > In general, the problem is that the JSON formatting is not specified in the > coding guidelines. That's why it comes with no surprise that these > formatting inconsistencies exist. We would need to start a discussion on > updating the Flink coding guidelines first. Only afterwards, we could fix > the formatting. > > Such a change would need to be rolled out as part of a major version (e.g. > 3.0) only, though. > > > > c. For "summary.rescaleCounts" > > > > > For "summary.rescaleCounts", we might not need to add the "_rescales" > > > suffix to the record fields since the parent indicates already that all > > of > > > the fields are rescale counts. We, therefore, could use "inProgress", > > > "ignored", "completed", "failed". > > > > Yes, this indeed makes the expression more concise and to the point. > > I updated this part. > > > > > Do we see value in adding the total > > > value? That could be easily calculated using the other four metrics. > > Hence, > > > I think we can consider it as being redundant and remove it. > > > > This is acceptable, as the one of differences lies in > > whether the total value is calculated on the FE side or on the backend. > > > > d. rescalesDurationStats/rescales_duration_stats(the previous edition) > > > > > "rescales_duration_stats" > > > For all the "durationStats"? Can we add the time unit to make things > > > clearer, e.g. "rescalesDurationStats" becomes > > > "rescalesDurationStatsInMillis"? ...same applies to the timestamps > > > > Good idea~. > > I update the description of all attributes about timestamps. > > Please help take a look! > > > > e. ignoredRescalesDurationStats/ignored_rescales_duration_stats(the > > previous edition) > > > > > "ignored_rescales_duration_stats" > > > Are the stats useful for rescales which were actually not executed? > > > > Answering this question may be a bit difficult for me. > > In theory, since rescale operations of the Ignored type can occur, > > it is reasonable to include them in the statistics—at least > > from the perspective of having a complete set of dimensions. > > In addition, I'm not certain whether users truly do not care > > about statistics for this type of data. > > Therefore, I kept it in the initial design document. > > If you think it is unnecessary to retain this data, > > we can exclude Ignored rescale types from the duration statistics. > > I would appreciate your experience and opinion on this. > > > Fair enough. > > f. the durationInMillis attribute. > > > > > duration > > > Rescale details already contain the start and end time. Adding the > > duration > > > here shouldn't be necessary. > > > > If the frontend page does not involve overly complex display logic, > > adding an additional durationInMillis field here should be unnecessary. > > > > Just to clarify: I don't suggest removing the duration information from the > web UI. It's only obsolete in the REST API because it can be calculated on > the client side. > > > > > > 3. UI > > > > a. Rescale History UI(related to 'durationInMillis' attribute) > > > > > Rescale History UI > > > The history looks nice. What making the duration of the inProgress > > rescales > > > dynamic, i.e. counting the seconds up from the start time? Keeping the > NA > > > is also fine if the dynamic approach is too complicated. > > > > In my limited reading, > > this is feasible from an implementation perspective, > > though it may require some adjustments. > > If we remove the durationInMillis field from rescale, > > the frontend would need to perform some additional processing when > > displaying the data. > > For example: > > rescale{terminalState=inProgress, startTimestampInMillis=1, > > endTimestampInMillis=null, durationInMillis=3} > > If we keep the durationInMillis field, the frontend would almost not need > > any logic and could simply display the data as is. > > If we do not keep the durationInMillis field, the frontend would need to > do > > two things when rendering: > > - Calculate durationInMillis based on startTimestampInMillis and > > endTimestampInMillis > > - When displaying records with terminalState = inProgress, show > > endTimestampInMillis as null > > > > Similarly, for handling durationInMillis in schedulerState, > > I‘m not sure whether such scenarios would arise, > > although we have not yet considered > > whether this data should be displayed in the same way as > > Rescale.durationInMillis. > > Although the difference is small, > > it is worth clarifying so that we can better evaluate the decision. > > > > Therefore, please let me know your thoughts on > > - whether we should keep the durationInMillis field for both Rescale and > > schedulerState in the schema > > - Show N.A in the duration of InProgress Rescale and remove the > > durationInMillis in the related sub-json. > > - Or something reasonable from you. > > > > As mentioned in 2.f), I would remove the duration and calculate it > dynamically in the client code. It shouldn't be a too complex operation and > allows us to keep the duration dynamic for rescales in progress. > > > > b. Rescale Overview UI. > > > > > Rescale Overview UI > > > The screenshot shows "Acquired profile" twice for the slot (based on > the > > > details UI, the first one is supposed to be "required"). > > > > Sorry for the typo. I corrected it. > > > > > Additionally, in > > > FLIP-495 we agreed on four metrics: previous, sufficient, desired and > > > acquired resources (for parallelism and profile). Should we use those > in > > > the UI as well? > > > > Okay. Updated it in the related UI draft pages. > > > > > We might want to add tooltips to the headers as well to > > > add a description for each of the metrics. > > > > > Could we add tooltips to the headers of the rescale overview to > describe > > the different IDs? > > > > Yes, the suggestion is reasonable. > > And I added the description of hint messages about some core header > > attributes after the corresponding UI draft pages. > > Looking forward to your opinion. > > > > 4. The new added items by me: > > I have added notes after some sections of the core UI pages regarding > > limiting the displayed length of UUID-type identifiers and issues related > > to task names. > > > > I'd greatly appreciate any suggestions you may have. > > > > > > Best regards, > > Yuepeng Pan > > > > > > Matthias Pohl <[email protected]> 于2025年12月18日周四 18:08写道: > > > > > Hi Yuepeng, > > > I finally found some time to look into that FLIP again. Sorry for the > > > delay. Thanks for working on this topic and pushing it. Here are a few > > more > > > comments on the current state of FLIP-487: > > > > > > Adaptive Scheduler will support record and query the rescale history > > in[2]. > > > > > > Shouldn't it have refer to reference #3, i.e. FLIP-495? > > > > > > nit: In the wiki, we do not need to add the references but use links > with > > > proper link text (e.g. in the motivation paragraph). That should > improve > > > readability. > > > > > > extended schema of the response for /jobs/overview > > > > > > The extract of the schema extension is not precise: We should show, > that > > > the new fields are added to the item type > > > > (urn:jsonschema:org:apache:flink:runtime:messages:webmonitor:JobDetails). > > > About the field name formatting of "job-type": We still do not have > this > > > one included in the code convention. But AFAIS, we usually follow > > camelCase > > > format rather kebab-casing. But especially the Job overview uses both > > > already. > > > > > > Could we add tool tips to the headers of the rescale overview to > describe > > > the different IDs? > > > > > > Schema of response for /jobs/:jobid/rescales > > > > > > I noticed that also for the other JSON schemas, we jump between formats > > > (even introducing snake_casing). Let's unify them and stick to > camelCase. > > > WDYT? > > > > > > For "summary.rescaleCounts", we might not need to add the "_rescales" > > > suffix to the record fields since the parent indicate already that all > of > > > the fields are rescale counts. We, therefore, could use "inProgress", > > > "ignored", "completed", "failed". Do we see value in adding the total > > > value? That could be easily calculated using the other four metrics. > > Hence, > > > I think we can consider it as being redundant and remove it. > > > > > > "rescales_duration_stats" > > > > > > For all the "durationStats"? Can we add the time unit to make things > > > clearer, e.g. "rescalesDurationStats" becomes > > > "rescalesDurationStatsInMillis"? ...same applies to the timestamps > > > > > > "ignored_rescales_duration_stats" > > > > > > Are the stats useful for rescales which were actually not executed? > > > > > > duration > > > > > > Rescale details already contain the start and end time. Adding the > > duration > > > here shouldn't be necessary. > > > > > > Rescale Overview UI > > > > > > > > > The screenshot shows "Acquired profile" twice for the slot (based on > the > > > details UI, the first one is supposed to be "required"). Additionally, > in > > > FLIP-495 we agreed on four metrics: previous, sufficient, desired and > > > acquired resources (for parallelism and profile). Should we use those > in > > > the UI as well? We might want to add tool tips to the headers as well > to > > > add a description for each of the metrics. > > > > > > Rescale History UI > > > > > > The history looks nice. What making the duration of the inProgress > > rescales > > > dynamic, i.e. counting the seconds up from the start time? Keeping the > NA > > > is also fine if the dynamic approach is too complicated. > > > > > > Best, > > > Matthias > > > > > > On Wed, Nov 5, 2025 at 11:24 AM Yuepeng Pan <[email protected]> > > wrote: > > > > > > > Bumping this thread. Thanks! > > > > > > > > Best regards, > > > > Yuepeng Pan > > > > > > > > > > > > > > > > On 2025/09/02 15:41:07 Yuepeng Pan wrote: > > > > > Hi, community. > > > > > > > > > > > > > > > At present, FLIP-495[1][2] has gone through a new round of > > discussions > > > > and a preliminary general consensus has been reached, which provides > > the > > > > necessary premise for the discussion of the current FLIP-487[3]. > > > > > > > > > > > > > > > Therefore, I would like to resume the discussion on the current > FLIP. > > > > > > > > > > The version of the current FLIP mainly covers and has completed the > > > > following two aspects of design: > > > > > - The REST API design for querying rescale history information > > > > > - The Web UI design for showing rescale history information > > > > > > > > > > > > > > > Looking forward to your comments and suggestions. > > > > > > > > > > > > > > > [1] > https://lists.apache.org/thread/t3r9wdd5gpbqnvzw35kb3wb3d9brpnon > > > > > [2] > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-495%3A+Support+AdaptiveScheduler+record+and+query+the+rescale+history > > > > > [3] > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-487%3A+Show+history+of+rescales+in+Web+UI+for+AdaptiveScheduler > > > > > > > > > > > > > > > Best regards, > > > > > Yuepeng Pan > > > > > > > > > > > > > > > ---- Replied Message ---- > > > > > | From | Matthias Pohl<[email protected]> | > > > > > | Date | 12/2/2024 16:59 | > > > > > | To | <[email protected]> | > > > > > | Subject | Re: [DISCUSS] FLIP-487: Show history of rescales in Web > > UI > > > > for AdaptiveScheduler | > > > > > Hi Yuepeng, > > > > > thanks for the proposal. Having a way to see the history of > rescales > > > is a > > > > > nice feature, I guess. I went over the draft and have a few > > questions: > > > > > > > > > > Can we reorganize the draft? Right now, we have some (for > > RescaleEvent, > > > > > Required/AcquiredParallelism) schema defined in the "Proposed > > Changes" > > > > > section and some other schema under "Public Interfaces". It would > be > > > nice > > > > > to have this more organized. > > > > > Just as a suggestion: In the end the proposed changes should list > the > > > > > different REST endpoints you want to introduce (including the > > > > corresponding > > > > > schemas for request and response). > > > > > --- > > > > > I'm also wondering whether it would make sense to focus on the REST > > > > > endpoints in this FLIP and put the UI work in a separate FLIP. > WDYT? > > > > > Decreasing the scope would probably help handling the required > > changes. > > > > > --- > > > > > Have you considered adding the onChange event timestamp for a > rescale > > > > event > > > > > as well? We introduced a separation of the job requirements change > > > event > > > > > and the actual rescale execution in FLIP-461 [1]. It might be worth > > > > > documenting the time when a change was monitored for the first time > > > that > > > > > triggered the rescale. WDYT? > > > > > --- > > > > > You're mentioning "comments" as a field of the RescaleEvent in your > > > > > proposal. What's the use-case here? Where are these comments from? > > > > > > > > > > (update) > > > > > A brief talk with Yuepeng on that topic revealed that the field is > > > > supposed > > > > > to be used for errors that occurred during the rescale operation. > My > > > take > > > > > on that one: > > > > > - We might want to reconsider the field name in that case (maybe > > > > > errors_during_rescale?). "comments" seems to be quite generic. > > > > > - Additionally, shouldn't we make this a list of errors rather > than a > > > > > String field? > > > > > - How certain are we that we can associate errors to the actual > > rescale > > > > > operation and rather than the error being caused by something else? > > > > > --- > > > > > In the schema of the RescaleEvent you describe the three different > > > > > ID/numbers in the following way: > > > > > > > > > > The ‘id’ is automatically incremental, The rescaleAttemptId is > > > generated > > > > > based on one specified resource-requirement and the attempt number > is > > > > > generated based on rescaleAttemptId. > > > > > > > > > > But there is no "attempt number" mentioned in the RescaleEvent > > schema. > > > > > Additionally, what is the ID based on? Do we start from 0 and just > > > > > increment? Or do we want to have a mechanism that ensures that the > > IDs > > > > are > > > > > also unique/monotonically increasing after JobManager failovers? > > > > > --- > > > > > For the parallelism schema: I might be misreading the draft here > but > > > > you're > > > > > proposing to use the subtask name as the ID to refer to the > > JobVertex? > > > > That > > > > > the name might become quite long. What about using the JobVertexID > > > here. > > > > > That would be also more aligned to how the parallelism is > represented > > > by > > > > > the /jobs/<job-id>/resource-requirements endpoint. If we want to > add > > > the > > > > > task name for readability purposes, we can still add this one as a > > > > taskName > > > > > field to the Required/AcquiredParallelism schema. > > > > > --- > > > > > Status field: > > > > > - What is the meaning of "TRYING"? I guess, we're more or less > using > > > the > > > > > AdaptiveScheduler states here, aren't we? Can't we align/stick to > the > > > > > naming that's defined in the AdaptiveScheduler state? > > > > > --- > > > > > Do we really need a new REST endpoint for the configuration? Can't > we > > > get > > > > > the provided information already from the existing configuration > > > > endpoint? > > > > > That said, I still find it useful to have a config tab in the UI at > > the > > > > end. > > > > > --- > > > > > For the summary endpoint: I see similarities to the checkpoint > > summary > > > > > here. Not sure whether you already considered that but would it > make > > > > sense > > > > > to align the field names in some way to have a consistent > > > look-and-feel? > > > > > I'm also wondering whether it makes sense to align the schema to > have > > > > > something like latest rescale, failed rescale, ... > > > > > > > > > > Best, > > > > > Matthias > > > > > > > > > > [1] > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler > > > > > > > > > > On Mon, Nov 25, 2024 at 11:24 AM yuanfeng hu <[email protected]> > > > > wrote: > > > > > > > > > > +1, I think this feature is very useful for adaptive scheduler. > > > > > > > > > > Yuepeng Pan <[email protected]> 于2024年11月22日周五 18:38写道: > > > > > > > > > > Hi community, > > > > > > > > > > > > > > > > > > > > > > > > > Currently, the Adaptive Scheduler already supports the REST API > > > > > > > > > > to manually adjust[1] the parallelism of jobs, which enhances the > > > > > > > > > > functionality of the Adaptive Scheduler. > > > > > > > > > > However, Adaptive Scheduler doesn't support displaying or tracing > the > > > > > rescale history yet[2]. > > > > > > > > > > This makes it inconvenient for users/devs to quickly obtain some > > > internal > > > > > > > > > > information about the rescale history of the Adaptive Scheduler. > > > > > > > > > > And showing the history of rescale events of AdaptiveScheduler in > the > > > web > > > > > > > > > > UI is very useful for users to make the next step for jobs. > > > > > > > > > > > > > > > > > > > > > > > > > Therefore, I created the FLIP-487[3] doc to support > > > > > > > > > > 'Show history of rescales in Web UI for AdaptiveScheduler'. > > > > > > > > > > Please refer to the google document[3] for more details > > > > > > > > > > about the proposed design and implementation. > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to any feedback and opinions on this proposal. > > > > > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management > > > > > > > > > > [2] https://issues.apache.org/jira/browse/FLINK-22258 > > > > > > > > > > [3] > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1WrLBkSkYe2tBQ3j66gKHFr2OB0d1HuHKDrRVr6B8nkM/edit?tab=t.0 > > > > > > > > > > > > > > > > > > > > > > > > > Thank you very much. > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Regards. > > > > > > > > > > Yuepeng Pan > > > > > > > > > > > > > > > > > > > > -- > > > > > Best, > > > > > Yuanfeng > > > > > > > > > > > > > > > > > > > >
