[ https://issues.apache.org/jira/browse/FLINK-25855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann closed FLINK-25855. --------------------------------- Fix Version/s: 1.15.0 Resolution: Fixed Fixed via fb14d4d9671eb91035d5103fb3ca814e5d02d6b6 > DefaultDeclarativeSlotPool rejects offered slots when the job is restarting > --------------------------------------------------------------------------- > > Key: FLINK-25855 > URL: https://issues.apache.org/jira/browse/FLINK-25855 > Project: Flink > Issue Type: Sub-task > Components: Runtime / Coordination > Affects Versions: 1.15.0, 1.14.3 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > The {{DefaultDeclarativeSlotPool}} rejects offered slots if the job is > currently restarting. The problem is that in case of a job restart, the > scheduler sets the required resources to zero. Hence, all offered slots will > be rejected. > This is a problem for local recovery because rejected slots will be freed by > the {{TaskExecutor}} and thereby all local state will be deleted. Hence, in > order to properly support local recovery, we need to handle this situation > somehow. I do see different options here: > This problem only affects the {{DefaultScheduler}} since the > {{AdaptiveScheduler}} sets the required resources when transitioning into the > {{WaitingForResources}} state. > h4. Accept excess slots > Accepting excess slots means that the {{DefaultDeclarativeSlotPool}} accepts > slots which exceed the currently required set of slots. > Advantages: > * Easy to implement > Disadvantages: > * Offered slots that are not really needed will only be freed after the idle > slot timeout. This means that some resources might be left unused for some > time. > h4. Let DefaultDeclarativeSlotPool accept excess slots only when job is > restarting > Here the idea is to only accept excess slots when the job is currently > restarting. This will required that the scheduler tells the > {{DefaultDeclarativeSlotPool}} about the restarting state. > Advantages: > * We would only accept excess slots for the time of restarting > Disadvantages: > * We are complicating the semantics of the {{DefaultDeclarativeSlotPool}}. > Moreover, we are introducing additional signals that communicate the > restarting state to the pool. > h4. Don't immediately free slots on the TaskExecutor when they are rejected > Instead of freeing the slot immediately on the {{TaskExecutor}} after it is > rejected. We could also retry for some time and only free the slot after some > timeout. > Advantages: > * No changes on the JobMaster side needed. > Disadvantages: > * Complication of the slot lifecycle on the {{TaskExecutor}} > * Unneeded slots are not made available for other jobs as fast as possible > h4. Don't zero resource requirements during job restart > Instead of zeroing the resource requirements during a job restart, we could > also keep the last know requirements. Once the job is restarted, we could > adjust the requirements. > Advantages: > * Conceptually easy to do > Disadvantages: > * The old requirements mustn't necessarily be the new ones > * Convolutes logic in the scheduler -- This message was sent by Atlassian Jira (v8.20.1#820001)