Re:Re: [DISCUSS] FLIP-495: Support AdaptiveScheduler record and query the rescale history

Yuepeng Pan Fri, 03 Jan 2025 02:47:54 -0800

Thanks Matthias very much for the sharing and suggestions.

The ideas make sense to me on the whole and I'd be willing to update it.


I just want to confirm a few questions before starting the next round of wiki 
adjustments:







1 - About the Boundary Definition of Rescale Events




> A few other items I'd like to point out are the following ones:

> - Is the "Rescale Event" section still out-dated or do we have a different

> understanding of what a rescale operation is? Based on what I pointed out

> in my previous post, I would think that a rescale operation has its

> starting point in the AdaptiveScheduler's Executing state (i.e. when the

> job is running). That's how it is implemented right now. The "Rescale

> Event" section always starts from WaitingForResources, though. Do we

> disagree here?




Sorry for not expressing this part clearly earlier.

IIUC, based on the Adaptive Scheduler state diagram [1], 

when a stop-with-savepoint operation fails but the job is restartable, 

it transitions to the WaitingForResources state, which implies a 

potential rescaling process may trigger.

From the current logic, a rescale event may be triggered under the following 
circumstances, called 'rescale triggers' here:




- updateJobResourceRequirement

- Restart due to recoverable failure

- newResourceAvailable




From Fig [1], the entire chain of rescale-related scheduler 

states typically involves several loops:




- WaitingForResources -> CreatingExecutionGraph -> Executing -> 
StopWithSavepoint (error & restartable) -> Restarting -> WaitingForResources

- CreatingExecutionGraph [-> WaitingForResources -> 
CreatingExecutionGraph](optinal loop ) -> Executing(rescale triggers) -> 
Restarting -> CreatingExecutionGraph




Following your shared ideas: > Based on what I pointed out in my previous post, 
I would think that a rescale operation has its >  starting point in the 
AdaptiveScheduler's Executing state (i.e. when the job is running).  I 
attempted to interpret the boundary definition of rescale events. 

The historical states of the Adaptive Scheduler during a 

successful rescale event would likely match one of the following patterns ?




- (Starting) Executing(rescale triggers) -> Restarting -> 
CreatingExecutionGraph [ -> WaitingForResources -> CreatingExecutionGraph] -> 
(Ending) Executing

- (Starting) Executing -> StopWithSavepoint (error & restartable) -> Restarting 
-> WaitingForResources -> CreatingExecutionGraph [ -> WaitingForResources -> 
CreatingExecutionGraph] -> (Ending) Executing




If my understanding is incorrect, please feel free to correct me.







2 - About the status of a rescale event.




> - Rescale status: I would say that this section needs to be reworked. 




Can I understand that Rescale Event does indeed need some status information,
such as FAILED, IGNORED, SUCCESS, and PENDING, etc,
to indicate the final status of a completed event or the current status of an 
ongoing event?
In other words, the current status fields and their associations are 
unreasonable,
so its need to be redesigned rather than discarding this descriptive mechanism.







[1]https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=334760525#id-[WIP]FLIP495:SupportAdaptiveSchedulerrecordandquerytherescalehistory-CurrentadaptiveschedulerstatetransitionFig.

Thank you.


Best,

Yuepeng








At 2025-01-03 01:15:54, "Matthias Pohl" <map...@apache.org> wrote:
>Thanks Yuepeng for your response. I added my comments to the individual
>paragraphs below:
>
>
>> Thank you very much for the reminding The proposal makes sense to me.
>> Additionally, I'd like to confirm whether each rescale cycle/event
>> requires a status field, such as FAILED, IGNORED, SUCCESS, PENDING, etc. If
>> such state fields are not needed, how do we record that a particular
>> rescale request was ignored? Or do we not care about this situation and
>> only plan to record successful rescale events?
>>
>
>Ignored events (due to the AdaptiveScheduler not being in Executing state)
>can be still collected by the AdaptiveScheduler in
>updateJobResourceRequirements [1]. That's where the rescale history could
>be populated.
>
>Also keep in mind that it's not only the JobResourceRequirements update
>that can trigger a rescale. Updating the available resources through
>newResourcesAvailable [2] can also initiate a rescale. I don't see this
>being clearly laid out in FLIP-495, yet.
>
>[1]
>https://github.com/apache/flink/blob/a5258a015d553196d34e082e75ca4ae916addbaf/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java#L1115
>
>[2]
>https://github.com/apache/flink/blob/fc0ccf325527c3589d5cd5ae7397b22c22321cec/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/ResourceListener.java#L28
>
>Yes, this could happen during the reserving slots phase, and it's important
>> to record this. In my limited read, this is feasible. We can collect the
>> exceptions from this phase while gathering the scheduler state history, and
>> record the specific information using the previously mentioned exception
>> field and comments field, or use failed status  as the final status of the
>> rescale event/record, WDYTA?
>>
>
>I struggle to follow why we put this in a comment and/or exception field.
>Shouldn't we have the information provided in structured fields for each
>event? Something like:
>- pre-rescale resources
>- on-rescale-trigger sufficient resources
>- on-rescale-trigger desired resources
>- actual resources used when rescaling finished (i.e. reaching Executing
>state)
> That would give a clear summary of what was available when deciding to
>rescale and what was used in the end. WDYT?
>
>
>> > I feel like on disk approach (analogously to the exception
>> >history) makes the most sense here. WDYT?
>>
>> Sorry，Matthias, IIUC,. If the storage mechanism here is similar to that of
>> the exception history, then we should choose the DFS approach, such as
>> HDFS. Please correct me if I’m wrong.
>>
>
>I agree that we might want to have this information being stored in DFS.
>That way the information would survive a JobManager failover. It would be
>still nice to have all three options being reflected in the FLIP, though,
>with the pro's and con's having the feature properly documented.
>
>As a side note: There is also FLIP-360 [3] which proposes merging the
>ExecutionGraphInfoStore and the JobResultStore into a single component.
>That would give us a single store for any completed job that could include
>the job's result, its exception history and the rescale history. No need to
>rely on the HistoryServer anymore. But that's out-of-scope for FLIP-495.
>But implementing the JM-local-disk approach and working on FLIP-360 would
>be another option.
>
>[3]
>https://cwiki.apache.org/confluence/display/FLINK/FLIP-360%3A+Merging+the+ExecutionGraphInfoStore+and+the+JobResultStore+into+a+single+component+CompletedJobStore
>
>
>> BTW, the subsequent FLIP content will be maintained in the wiki page, and
>> the version in Google Docs will be deprecated.
>>
>
>A few other items I'd like to point out are the following ones:
>- Is the "Rescale Event" section still out-dated or do we have a different
>understanding of what a rescale operation is? Based on what I pointed out
>in my previous post, I would think that a rescale operation has its
>starting point in the AdaptiveScheduler's Executing state (i.e. when the
>job is running). That's how it is implemented right now. The "Rescale
>Event" section always starts from WaitingForResources, though. Do we
>disagree here?
>
>- Rescale status: I would say that this section needs to be reworked. For
>instance:
>  - Why is a rescale having the STARTING status when the scheduler is in
>StopWithSavepoint state? We shouldn't rescale if the user wants to stop the
>job? Right now, rescale events are ignored in StopWithSavepoint.
>  - The section seems to be out-dated. I already pointed out in my previous
>message that the rescale operation doesn't touch WaitingForResources but
>goes from Restarting straight to CreatingExecutionGraph.
>  - Can the content of this section be visualized in a proper control flow
>diagram, instead? That might help understanding the goal of this section a
>bit more.
>
>- The Rescale ID section:
>  - You're talking about resourceRequirementsEpochID, rescale ID and
>rescale attempt here: The resourceRequirementsEpochID is used as a general
>UUID, the attempt ID is a monotonically increasing number in the scope of a
>single resource requirements update. And the rescale ID is a globally
>monotonically increasing ID? Can you elaborate on the purpose of each of
>the IDs? Why do we need two globally-scoped IDs here?
>  - Can you document what an attempt is? Is a failed attempt defined by the
>outcome of the rescale operation (i.e. the desired parallelism isn't
>reached)?
>
>Generally, it might help to add more diagrams (especially for documenting
>state machines). That might be easier to understand than plain text.
>
>I'm looking forward to your response.
>
>Best,
>Matthias

Re:Re: [DISCUSS] FLIP-495: Support AdaptiveScheduler record and query the rescale history

Reply via email to