Re: [DISCUSS] FLIP-487: Show history of rescales in Web UI for AdaptiveScheduler

Matthias Pohl Mon, 02 Dec 2024 01:01:25 -0800

Hi Yuepeng,
thanks for the proposal. Having a way to see the history of rescales is a
nice feature, I guess. I went over the draft and have a few questions:


Can we reorganize the draft? Right now, we have some (for RescaleEvent,
Required/AcquiredParallelism) schema defined in the "Proposed Changes"
section and some other schema under "Public Interfaces". It would be nice
to have this more organized.
Just as a suggestion: In the end the proposed changes should list the
different REST endpoints you want to introduce (including the corresponding
schemas for request and response).
---
I'm also wondering whether it would make sense to focus on the REST
endpoints in this FLIP and put the UI work in a separate FLIP. WDYT?
Decreasing the scope would probably help handling the required changes.
---
Have you considered adding the onChange event timestamp for a rescale event
as well? We introduced a separation of the job requirements change event
and the actual rescale execution in FLIP-461 [1]. It might be worth
documenting the time when a change was monitored for the first time that
triggered the rescale. WDYT?
---
You're mentioning "comments" as a field of the RescaleEvent in your
proposal. What's the use-case here? Where are these comments from?

(update)
A brief talk with Yuepeng on that topic revealed that the field is supposed
to be used for errors that occurred during the rescale operation. My take
on that one:
- We might want to reconsider the field name in that case (maybe
errors_during_rescale?). "comments" seems to be quite generic.
- Additionally, shouldn't we make this a list of errors rather than a
String field?
- How certain are we that we can associate errors to the actual rescale
operation and rather than the error being caused by something else?
---
In the schema of the RescaleEvent you describe the three different
ID/numbers in the following way:

> The ‘id’ is automatically incremental, The rescaleAttemptId is generated
> based on one specified resource-requirement and the attempt number is
> generated based on rescaleAttemptId.

 But there is no "attempt number" mentioned in the RescaleEvent schema.
Additionally, what is the ID based on? Do we start from 0 and just
increment? Or do we want to have a mechanism that ensures that the IDs are
also unique/monotonically increasing after JobManager failovers?
---
For the parallelism schema: I might be misreading the draft here but you're
proposing to use the subtask name as the ID to refer to the JobVertex? That
the name might become quite long. What about using the JobVertexID here.
That would be also more aligned to how the parallelism is represented by
the /jobs/<job-id>/resource-requirements endpoint. If we want to add the
task name for readability purposes, we can still add this one as a taskName
field to the Required/AcquiredParallelism schema.
---
Status field:
- What is the meaning of "TRYING"? I guess, we're more or less using the
AdaptiveScheduler states here, aren't we? Can't we align/stick to the
naming that's defined in the AdaptiveScheduler state?
---
Do we really need a new REST endpoint for the configuration? Can't we get
the provided information already from the existing configuration endpoint?
That said, I still find it useful to have a config tab in the UI at the end.
---
For the summary endpoint: I see similarities to the checkpoint summary
here. Not sure whether you already considered that but would it make sense
to align the field names in some way to have a consistent look-and-feel?
I'm also wondering whether it makes sense to align the schema to have
something like latest rescale, failed rescale, ...

Best,
Matthias

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-461%3A+Synchronize+rescaling+with+checkpoint+creation+to+minimize+reprocessing+for+the+AdaptiveScheduler

On Mon, Nov 25, 2024 at 11:24 AM yuanfeng hu <[email protected]> wrote:

> +1, I think this feature is very useful for adaptive scheduler.
>
> Yuepeng Pan <[email protected]> 于2024年11月22日周五 18:38写道：
>
> > Hi community,
> >
> >
> >
> >
> > Currently, the Adaptive Scheduler already supports the REST API
> >
> > to manually adjust[1] the parallelism of jobs, which enhances the
> >
> > functionality of the Adaptive Scheduler.
> >
> > However, Adaptive Scheduler doesn't support displaying or tracing the
> > rescale history yet[2].
> >
> > This makes it inconvenient for users/devs to quickly obtain some internal
> >
> > information about the rescale history of the Adaptive Scheduler.
> >
> > And showing the history of rescale events of AdaptiveScheduler in the web
> >
> > UI is very useful for users to make the next step for jobs.
> >
> >
> >
> >
> > Therefore, I created the FLIP-487[3] doc to support
> >
> > 'Show history of rescales in Web UI for AdaptiveScheduler'.
> >
> > Please refer to the google document[3] for more details
> >
> > about the proposed design and implementation.
> >
> >
> >
> >
> > Looking forward to any feedback and opinions on this proposal.
> >
> >
> >
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-291%3A+Externalized+Declarative+Resource+Management
> >
> > [2] https://issues.apache.org/jira/browse/FLINK-22258
> >
> > [3]
> >
> https://docs.google.com/document/d/1WrLBkSkYe2tBQ3j66gKHFr2OB0d1HuHKDrRVr6B8nkM/edit?tab=t.0
> >
> >
> >
> >
> > Thank you very much.
> >
> >
> >
> >
> > Best,
> >
> > Regards.
> >
> > Yuepeng Pan
>
>
>
> --
> Best,
> Yuanfeng
>

Re: [DISCUSS] FLIP-487: Show history of rescales in Web UI for AdaptiveScheduler

Reply via email to