[ https://issues.apache.org/jira/browse/FLINK-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470253#comment-16470253 ]
ASF GitHub Bot commented on FLINK-9324: --------------------------------------- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/5980#discussion_r187307945 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SingleLogicalSlot.java --- @@ -159,10 +159,51 @@ public SlotSharingGroupId getSlotSharingGroupId() { * the logical slot. * * @param cause of the payload release - * @return true if the logical slot's payload could be released, otherwise false */ @Override - public boolean release(Throwable cause) { - return releaseSlot(cause).isDone(); + public void release(Throwable cause) { + if (STATE_UPDATER.compareAndSet(this, State.ALIVE, State.RELEASING)) { + signalPayloadRelease(cause); + } + state = State.RELEASED; + releaseFuture.complete(null); + } + + private CompletableFuture<?> signalPayloadRelease(Throwable cause) { + tryAssignPayload(TERMINATED_PAYLOAD); + payload.fail(cause); + + return payload.getTerminalStateFuture(); + } + + private void returnSlotToOwner(CompletableFuture<?> terminalStateFuture) { + final CompletableFuture<Boolean> slotReturnFuture = terminalStateFuture.handle((Object ignored, Throwable throwable) -> { + if (state == State.RELEASING) { --- End diff -- After thinking about this part, I think also a double check won't strictly prevent this from happening. The problem is that we don't know when the `SlotOwner` processes the return slot message. What could happen is the following: We send the release message and see afterwards that the state is still `RELEASING`. Before the `SlotOwner` processes the release slot message, it will trigger the `AllocatedSlot.Payload#release` call which releases the slot from the `SlotOwners` side. After this has happened, the owner processes the release message. Consequently, the slot owner has to work in both cases (`RELEASING` and `RELEASED`). With the double check we only make it less likely to happen. > SingleLogicalSlot returns completed release future before slot is properly > returned > ----------------------------------------------------------------------------------- > > Key: FLINK-9324 > URL: https://issues.apache.org/jira/browse/FLINK-9324 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.0, 1.6.0 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > The {{SingleLogicalSlot#releaseSlot}} method returns a future which is > completed once the slot has been returned to the {{SlotOwner}}. > Unfortunately, we don't wait for the {{SlotOwner's}} response to complete the > future but complete it directly after the call has been made. This causes > that the {{ExecutionGraph}} can get restarted in case of a recovery before > all of its slots have been returned to the {{SlotPool}}. As a consequence, > the allocation of the new tasks might require more than the max parallelism > because of collisions with old tasks (in case of slot sharing). -- This message was sent by Atlassian JIRA (v7.6.3#76005)