Matthias Pohl created FLINK-36194:
-------------------------------------

             Summary: Shutdown hook for ExecutionGraphInfo store runs 
concurrently to cluster shutdown hook causing race conditions
                 Key: FLINK-36194
                 URL: https://issues.apache.org/jira/browse/FLINK-36194
             Project: Flink
          Issue Type: Technical Debt
          Components: Runtime / Coordination
    Affects Versions: 1.19.1, 1.20.0, 2.0.0
            Reporter: Matthias Pohl


There is an {{FileNotFoundException}} being logged when shutting down the 
cluster with currently running jobs:
{code}
/tmp/executionGraphStore-b2cb1190-2c4d-4021-a73d-8b15027860df/8f6abf294a46345d331590890f7e7c37
 (No such file or directory)

java.io.FileNotFoundException: 
/tmp/executionGraphStore-b2cb1190-2c4d-4021-a73d-8b15027860df/8f6abf294a46345d331590890f7e7c37
 (No such file or directory)
        at java.base/java.io.FileOutputStream.open0(Native Method)
        at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237)
        at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187)
        at 
org.apache.flink.runtime.dispatcher.FileExecutionGraphInfoStore.storeExecutionGraphInfo(FileExecutionGraphInfoStore.java:281)
        at 
org.apache.flink.runtime.dispatcher.FileExecutionGraphInfoStore.put(FileExecutionGraphInfoStore.java:203)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.writeToExecutionGraphInfoStore(Dispatcher.java:1427)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.jobReachedTerminalState(Dispatcher.java:1357)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:750)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$6(Dispatcher.java:700)
        at 
java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930)
        at 
java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
        at 
java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
[...]
{code}

This is caused by concurrent shutdown logic being triggered through the 
{{FileExecutionGraphInfoStore}} shutdown hook. The shutdown hook calls close on 
the store which will delete its temporary directory. 

The concurrently performed cluster shutdown will try to suspend all running 
jobs. The JobManagerRunners are trying to write their {{ExecutionGraphInfo}} to 
the store which fails (because the temporary folder is deleted).

This doesn't have any impact because the JobManager goes away, anyway. But the 
log message is confusing the the shutdown hook is (IMHO) not needed. Instead, 
the {{ExecutionGraphInfoStore}}'s close logic should be called by the 
{{ClusterEntrypoint}} shutdown gracefully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to