Matthias Pohl created FLINK-36194: ------------------------------------- Summary: Shutdown hook for ExecutionGraphInfo store runs concurrently to cluster shutdown hook causing race conditions Key: FLINK-36194 URL: https://issues.apache.org/jira/browse/FLINK-36194 Project: Flink Issue Type: Technical Debt Components: Runtime / Coordination Affects Versions: 1.19.1, 1.20.0, 2.0.0 Reporter: Matthias Pohl
There is an {{FileNotFoundException}} being logged when shutting down the cluster with currently running jobs: {code} /tmp/executionGraphStore-b2cb1190-2c4d-4021-a73d-8b15027860df/8f6abf294a46345d331590890f7e7c37 (No such file or directory) java.io.FileNotFoundException: /tmp/executionGraphStore-b2cb1190-2c4d-4021-a73d-8b15027860df/8f6abf294a46345d331590890f7e7c37 (No such file or directory) at java.base/java.io.FileOutputStream.open0(Native Method) at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) at org.apache.flink.runtime.dispatcher.FileExecutionGraphInfoStore.storeExecutionGraphInfo(FileExecutionGraphInfoStore.java:281) at org.apache.flink.runtime.dispatcher.FileExecutionGraphInfoStore.put(FileExecutionGraphInfoStore.java:203) at org.apache.flink.runtime.dispatcher.Dispatcher.writeToExecutionGraphInfoStore(Dispatcher.java:1427) at org.apache.flink.runtime.dispatcher.Dispatcher.jobReachedTerminalState(Dispatcher.java:1357) at org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:750) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$6(Dispatcher.java:700) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) [...] {code} This is caused by concurrent shutdown logic being triggered through the {{FileExecutionGraphInfoStore}} shutdown hook. The shutdown hook calls close on the store which will delete its temporary directory. The concurrently performed cluster shutdown will try to suspend all running jobs. The JobManagerRunners are trying to write their {{ExecutionGraphInfo}} to the store which fails (because the temporary folder is deleted). This doesn't have any impact because the JobManager goes away, anyway. But the log message is confusing the the shutdown hook is (IMHO) not needed. Instead, the {{ExecutionGraphInfoStore}}'s close logic should be called by the {{ClusterEntrypoint}} shutdown gracefully. -- This message was sent by Atlassian Jira (v8.20.10#820010)