[ https://issues.apache.org/jira/browse/FLINK-36194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-36194: ----------------------------------- Labels: pull-request-available starter (was: starter) > Shutdown hook for ExecutionGraphInfo store runs concurrently to cluster > shutdown hook causing race conditions > ------------------------------------------------------------------------------------------------------------- > > Key: FLINK-36194 > URL: https://issues.apache.org/jira/browse/FLINK-36194 > Project: Flink > Issue Type: Technical Debt > Components: Runtime / Coordination > Affects Versions: 2.0.0, 1.20.0, 1.19.1 > Reporter: Matthias Pohl > Assignee: Eaugene Thomas > Priority: Minor > Labels: pull-request-available, starter > > There is an {{FileNotFoundException}} being logged when shutting down the > cluster with currently running jobs: > {code} > /tmp/executionGraphStore-b2cb1190-2c4d-4021-a73d-8b15027860df/8f6abf294a46345d331590890f7e7c37 > (No such file or directory) > java.io.FileNotFoundException: > /tmp/executionGraphStore-b2cb1190-2c4d-4021-a73d-8b15027860df/8f6abf294a46345d331590890f7e7c37 > (No such file or directory) > at java.base/java.io.FileOutputStream.open0(Native Method) > at java.base/java.io.FileOutputStream.open(FileOutputStream.java:298) > at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:237) > at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:187) > at > org.apache.flink.runtime.dispatcher.FileExecutionGraphInfoStore.storeExecutionGraphInfo(FileExecutionGraphInfoStore.java:281) > at > org.apache.flink.runtime.dispatcher.FileExecutionGraphInfoStore.put(FileExecutionGraphInfoStore.java:203) > at > org.apache.flink.runtime.dispatcher.Dispatcher.writeToExecutionGraphInfoStore(Dispatcher.java:1427) > at > org.apache.flink.runtime.dispatcher.Dispatcher.jobReachedTerminalState(Dispatcher.java:1357) > at > org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:750) > at > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$6(Dispatcher.java:700) > at > java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > at > java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > at > java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) > [...] > {code} > This is caused by concurrent shutdown logic being triggered through the > {{FileExecutionGraphInfoStore}} shutdown hook. The shutdown hook calls close > on the store which will delete its temporary directory. > The concurrently performed cluster shutdown will try to suspend all running > jobs. The JobManagerRunners are trying to write their {{ExecutionGraphInfo}} > to the store which fails (because the temporary folder is deleted). > This doesn't have any impact because the JobManager goes away, anyway. But > the log message is confusing the the shutdown hook is (IMHO) not needed. > Instead, the {{ExecutionGraphInfoStore}}'s close logic should be called by > the {{ClusterEntrypoint}} shutdown gracefully. -- This message was sent by Atlassian Jira (v8.20.10#820010)