[ 
https://issues.apache.org/jira/browse/FLINK-9324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16470253#comment-16470253
 ] 

ASF GitHub Bot commented on FLINK-9324:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5980#discussion_r187307945
  
    --- Diff: 
flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/slotpool/SingleLogicalSlot.java
 ---
    @@ -159,10 +159,51 @@ public SlotSharingGroupId getSlotSharingGroupId() {
         * the logical slot.
         *
         * @param cause of the payload release
    -    * @return true if the logical slot's payload could be released, 
otherwise false
         */
        @Override
    -   public boolean release(Throwable cause) {
    -           return releaseSlot(cause).isDone();
    +   public void release(Throwable cause) {
    +           if (STATE_UPDATER.compareAndSet(this, State.ALIVE, 
State.RELEASING)) {
    +                   signalPayloadRelease(cause);
    +           }
    +           state = State.RELEASED;
    +           releaseFuture.complete(null);
    +   }
    +
    +   private CompletableFuture<?> signalPayloadRelease(Throwable cause) {
    +           tryAssignPayload(TERMINATED_PAYLOAD);
    +           payload.fail(cause);
    +
    +           return payload.getTerminalStateFuture();
    +   }
    +
    +   private void returnSlotToOwner(CompletableFuture<?> 
terminalStateFuture) {
    +           final CompletableFuture<Boolean> slotReturnFuture = 
terminalStateFuture.handle((Object ignored, Throwable throwable) -> {
    +                   if (state == State.RELEASING) {
    --- End diff --
    
    After thinking about this part, I think also a double check won't strictly 
prevent this from happening. The problem is that we don't know when the 
`SlotOwner` processes the return slot message. What could happen is the 
following: We send the release message and see afterwards that the state is 
still `RELEASING`. Before the `SlotOwner` processes the release slot message, 
it will trigger the `AllocatedSlot.Payload#release` call which releases the 
slot from the `SlotOwners` side. After this has happened, the owner processes 
the release message. Consequently, the slot owner has to work in both cases 
(`RELEASING` and `RELEASED`). With the double check we only make it less likely 
to happen.


> SingleLogicalSlot returns completed release future before slot is properly 
> returned
> -----------------------------------------------------------------------------------
>
>                 Key: FLINK-9324
>                 URL: https://issues.apache.org/jira/browse/FLINK-9324
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0, 1.6.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> The {{SingleLogicalSlot#releaseSlot}} method returns a future which is 
> completed once the slot has been returned to the {{SlotOwner}}. 
> Unfortunately, we don't wait for the {{SlotOwner's}} response to complete the 
> future but complete it directly after the call has been made. This causes 
> that the {{ExecutionGraph}} can get restarted in case of a recovery before 
> all of its slots have been returned to the {{SlotPool}}. As a consequence, 
> the allocation of the new tasks might require more than the max parallelism 
> because of collisions with old tasks (in case of slot sharing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to