[ https://issues.apache.org/jira/browse/FLINK-9635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16546775#comment-16546775 ]
Deepak Sharma commented on FLINK-9635: -------------------------------------- [~till.rohrmann], does this issue still need to be resolved, seeing as FLINK-9583 has been closed? I suppose this Jira tracks the long-term solution? > Local recovery scheduling can cause spread out of tasks > ------------------------------------------------------- > > Key: FLINK-9635 > URL: https://issues.apache.org/jira/browse/FLINK-9635 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.0 > Reporter: Till Rohrmann > Priority: Critical > Fix For: 1.7.0 > > > In order to make local recovery work, Flink's scheduling was changed such > that it tries to be rescheduled to its previous location. In order to not > occupy slots which have state of other tasks cached, the strategy will > request a new slot if the old slot identified by the previous allocation id > is no longer present. This also applies to newly allocated slots because > there is no distinction between new or already used. This behaviour can cause > that every tasks gets deployed to its own slot if the {{SlotPool}} has > released all slots in the meantime, for example. The consequence could be > that a job can no longer be executed after a failure because it needs more > slots than before. -- This message was sent by Atlassian JIRA (v7.6.3#76005)