[ 
https://issues.apache.org/jira/browse/FLINK-34227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17825698#comment-17825698
 ] 

Matthias Pohl edited comment on FLINK-34227 at 3/12/24 3:12 PM:
----------------------------------------------------------------

[~chesnay] supported the investigation. His findings are based around the 
question why the close call was actually triggered from within an IO thread:
# The {{JobMaster#onStop}} call triggers 
[JobMaster#stopJobExecution|https://github.com/apache/flink/blob/d6a4eb966fbc47277e07b79e7c64939a62eb1d54/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1048]
 which calls {{JobMaster#stopScheduling}} from within the JobMaster's main 
thread which calls {{AdaptiveScheduler#closeAsync}} in 
[JobMaster:1088|https://github.com/apache/flink/blob/d6a4eb966fbc47277e07b79e7c64939a62eb1d54/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1088].
# {{AdaptiveScheduler#closeAsync}} composes the resultFuture of the 
{{CheckpointsCleaner}} in 
[AdaptiveScheduler:581|https://github.com/apache/flink/blob/d4e0084649c019c536ee1e44bab15c8eca01bf13/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java#L581].
 The future is completed from within the {{ioExecutor}} that is used in the 
{{CheckpointsCleaner}}.
# The future is forwarded to {{JobMaster#stopJobExecution}} where the 
disconnect is triggered after the cleanup future is completed causing the 
disconnect to be executed in an IO thread rather than the JobMaster's main 
thread.

update: but that doesn't seem to be the problem either, because 
[JobMaster#stopJobExecution|https://github.com/apache/flink/blob/d6a4eb966fbc47277e07b79e7c64939a62eb1d54/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1048]
 is called from the main thread and {{FutureUtils#runAfterwards}} uses 
{{Executors#directExecutor}} internally to execute the callback. That should 
already ensure that the code is not executed in some other thread.


was (Author: mapohl):
[~chesnay] supported the investigation. His findings are based around the 
question why the close call was actually triggered from within an IO thread:
# The {{JobMaster#onStop}} call triggers 
[JobMaster#stopJobExecution|https://github.com/apache/flink/blob/d6a4eb966fbc47277e07b79e7c64939a62eb1d54/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1048]
 which calls {{JobMaster#stopScheduling}} from within the JobMaster's main 
thread which calls {{AdaptiveScheduler#closeAsync}} in 
[JobMaster:1088|https://github.com/apache/flink/blob/d6a4eb966fbc47277e07b79e7c64939a62eb1d54/flink-runtime/src/main/java/org/apache/flink/runtime/jobmaster/JobMaster.java#L1088].
# {{AdaptiveScheduler#closeAsync}} composes the resultFuture of the 
{{CheckpointsCleaner}} in 
[AdaptiveScheduler:581|https://github.com/apache/flink/blob/d4e0084649c019c536ee1e44bab15c8eca01bf13/flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveScheduler.java#L581].
 The future is completed from within the {{ioExecutor}} that is used in the 
{{CheckpointsCleaner}}.
# The future is forwarded to {{JobMaster#stopJobExecution}} where the 
disconnect is triggered after the cleanup future is completed causing the 
disconnect to be executed in an IO thread rather than the JobMaster's main 
thread.

> Job doesn't disconnect from ResourceManager
> -------------------------------------------
>
>                 Key: FLINK-34227
>                 URL: https://issues.apache.org/jira/browse/FLINK-34227
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.19.0, 1.18.1
>            Reporter: Matthias Pohl
>            Assignee: Matthias Pohl
>            Priority: Critical
>              Labels: github-actions, test-stability
>         Attachments: FLINK-34227.7e7d69daebb438b8d03b7392c9c55115.log, 
> FLINK-34227.log
>
>
> https://github.com/XComp/flink/actions/runs/7634987973/job/20800205972#step:10:14557
> {code}
> [...]
> "main" #1 prio=5 os_prio=0 tid=0x00007fcccc4b7000 nid=0x24ec0 waiting on 
> condition [0x00007fccce1eb000]
>    java.lang.Thread.State: WAITING (parking)
>       at sun.misc.Unsafe.park(Native Method)
>       - parking to wait for  <0x00000000bdd52618> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>       at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2131)
>       at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2099)
>       at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:2077)
>       at 
> org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.scala:876)
>       at 
> org.apache.flink.table.planner.runtime.stream.sql.WindowDistinctAggregateITCase.testHopWindow_Cube(WindowDistinctAggregateITCase.scala:550)
> [...]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to