[ https://issues.apache.org/jira/browse/FLINK-22014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450640#comment-17450640 ]
Adrian Vasiliu commented on FLINK-22014: ---------------------------------------- [~trohrmann] OK, thanks. I'll open a new issue with the logs of the job manager. We did reproduce it with: Flink 1.13.2 and Flink 1.13.3. Not yet tried Flink 1.14. > Flink JobManager failed to restart after failure in kubernetes HA setup > ----------------------------------------------------------------------- > > Key: FLINK-22014 > URL: https://issues.apache.org/jira/browse/FLINK-22014 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.11.3, 1.12.2, 1.13.0 > Reporter: Mikalai Lushchytski > Priority: Major > Labels: k8s-ha, pull-request-available > Attachments: flink-logs.txt.zip, image-2021-04-19-11-17-58-215.png, > scalyr-logs (1).txt > > > After the JobManager pod failed and the new one started, it was not able to > recover jobs due to the absence of recovery data in storage - config map > pointed at not existing file. > > Due to this the JobManager pod entered into the `CrashLoopBackOff`state and > was not able to recover - each attempt failed with the same error so the > whole cluster became unrecoverable and not operating. > > I had to manually delete the config map and start the jobs again without the > save point. > > If I tried to emulate the failure further by deleting job manager pod > manually, the new pod every time recovered well and issue was not > reproducible anymore artificially. > > Below is the failure log: > {code:java} > 2021-03-26 08:22:57,925 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManagerImpl [] - > Starting the SlotManager. > 2021-03-26 08:22:57,928 INFO > org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService [] - > Starting DefaultLeaderRetrievalService with KubernetesLeaderRetrievalDriver > {configMapName='stellar-flink-cluster-dispatcher-leader'}. > 2021-03-26 08:22:57,931 INFO > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Retrieved job > ids [198c46bac791e73ebcc565a550fa4ff6, 344f5ebc1b5c3a566b4b2837813e4940, > 96c4603a0822d10884f7fe536703d811, d9ded24224aab7c7041420b3efc1b6ba] from > KubernetesStateHandleStore{configMapName='stellar-flink-cluster-dispatcher-leader'} > 2021-03-26 08:22:57,933 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] > - Trying to recover job with job id 198c46bac791e73ebcc565a550fa4ff6. > 2021-03-26 08:22:58,029 INFO > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] > - Stopping SessionDispatcherLeaderProcess. > 2021-03-26 08:28:22,677 INFO > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore [] - Stopping > DefaultJobGraphStore. 2021-03-26 08:28:22,681 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error > occurred in the cluster entrypoint. java.util.concurrent.CompletionException: > org.apache.flink.util.FlinkRuntimeException: Could not recover job with job > id 198c46bac791e73ebcc565a550fa4ff6. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) > ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) [?:?] > at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) > [?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] > at java.lang.Thread.run(Unknown Source) [?:?] Caused by: > org.apache.flink.util.FlinkRuntimeException: Could not recover job with job > id 198c46bac791e73ebcc565a550fa4ff6. > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:144 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more > Caused by: org.apache.flink.util.FlinkException: Could not retrieve submitted > JobGraph from state handle under jobGraph-198c46bac791e73ebcc565a550fa4ff6. > This indicates that the retrieved state handle is broken. Try cleaning the > state handle store. > at > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:171 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more > Caused by: java.io.FileNotFoundException: No such file or directory: > s3a://XXX-flink-state-eu-central-1-live/recovery/YYY-flink-cluster/submittedJobGraph6797768d0737 > at > org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:2255 > undefined) ~[?:?] > at > org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:2149 > undefined) ~[?:?] > at > org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:2088 > undefined) ~[?:?] > at org.apache.hadoop.fs.s3a.S3AFileSystem.open(S3AFileSystem.java:699 > undefined) ~[?:?] > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:950 undefined) > ~[?:?] > at > org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:131 > undefined) ~[?:?] > at > org.apache.flink.fs.s3hadoop.common.HadoopFileSystem.open(HadoopFileSystem.java:37 > undefined) ~[?:?] > at > org.apache.flink.core.fs.PluginFileSystemFactory$ClassLoaderFixingFileSystem.open(PluginFileSystemFactory.java:125 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.state.filesystem.FileStateHandle.openInputStream(FileStateHandle.java:68 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.state.RetrievableStreamStateHandle.openInputStream(RetrievableStreamStateHandle.java:66 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:58 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.jobmanager.DefaultJobGraphStore.recoverJobGraph(DefaultJobGraphStore.java:162 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJob(SessionDispatcherLeaderProcess.java:141 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobs(SessionDispatcherLeaderProcess.java:122 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] > at > org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.recoverJobsIfRunning(SessionDispatcherLeaderProcess.java:113 > undefined) ~[flink-dist_2.12-1.12.2.jar:1.12.2] ... 4 more > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)