[jira] [Closed] (FLINK-25855) DefaultDeclarativeSlotPool rejects offered slots when the job is restarting

Till Rohrmann (Jira) Mon, 31 Jan 2022 09:51:34 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-25855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Till Rohrmann closed FLINK-25855.
---------------------------------
    Fix Version/s: 1.15.0
       Resolution: Fixed

Fixed via fb14d4d9671eb91035d5103fb3ca814e5d02d6b6

> DefaultDeclarativeSlotPool rejects offered slots when the job is restarting
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-25855
>                 URL: https://issues.apache.org/jira/browse/FLINK-25855
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.14.3
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> The {{DefaultDeclarativeSlotPool}} rejects offered slots if the job is 
> currently restarting. The problem is that in case of a job restart, the 
> scheduler sets the required resources to zero. Hence, all offered slots will 
> be rejected.
> This is a problem for local recovery because rejected slots will be freed by 
> the {{TaskExecutor}} and thereby all local state will be deleted. Hence, in 
> order to properly support local recovery, we need to handle this situation 
> somehow. I do see different options here:
> This problem only affects the {{DefaultScheduler}} since the 
> {{AdaptiveScheduler}} sets the required resources when transitioning into the 
> {{WaitingForResources}} state.
> h4. Accept excess slots
> Accepting excess slots means that the {{DefaultDeclarativeSlotPool}} accepts 
> slots which exceed the currently required set of slots. 
> Advantages: 
> * Easy to implement
> Disadvantages:
> * Offered slots that are not really needed will only be freed after the idle 
> slot timeout. This means that some resources might be left unused for some 
> time.
> h4. Let DefaultDeclarativeSlotPool accept excess slots only when job is 
> restarting
> Here the idea is to only accept excess slots when the job is currently 
> restarting. This will required that the scheduler tells the 
> {{DefaultDeclarativeSlotPool}} about the restarting state.
> Advantages:
> * We would only accept excess slots for the time of restarting
> Disadvantages:
> * We are complicating the semantics of the {{DefaultDeclarativeSlotPool}}. 
> Moreover, we are introducing additional signals that communicate the 
> restarting state to the pool.
> h4. Don't immediately free slots on the TaskExecutor when they are rejected
> Instead of freeing the slot immediately on the {{TaskExecutor}} after it is 
> rejected. We could also retry for some time and only free the slot after some 
> timeout.
> Advantages:
> * No changes on the JobMaster side needed.
> Disadvantages:
> * Complication of the slot lifecycle on the {{TaskExecutor}}
> * Unneeded slots are not made available for other jobs as fast as possible
> h4. Don't zero resource requirements during job restart
> Instead of zeroing the resource requirements during a job restart, we could 
> also keep the last know requirements. Once the job is restarted, we could 
> adjust the requirements.
> Advantages:
> * Conceptually easy to do
> Disadvantages:
> * The old requirements mustn't necessarily be the new ones
> * Convolutes logic in the scheduler



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Closed] (FLINK-25855) DefaultDeclarativeSlotPool rejects offered slots when the job is restarting

Reply via email to