uharaqo created FLINK-28206: ------------------------------- Summary: EOFException on Checkpoint Recovery Key: FLINK-28206 URL: https://issues.apache.org/jira/browse/FLINK-28206 Project: Flink Issue Type: Bug Components: Runtime / Checkpointing Affects Versions: 1.14.4 Reporter: uharaqo
We have only one Job Manager in Kubernetes and it suddenly got killed without any logs. A new Job Manager process could not recover from a checkpoint due to an EOFException. Task Managers killed themselves since they could not find any Job Manager. There were no error logs other than that on the Task Manager side. It looks to me that the checkpoint is corrupted. Is there a way to identify the cause? What would you recommend us to do to mitigate this problem? Here's the logs during the recovery phase. (Removed the stacktrace. Please find that at the bottom.) {noformat} {"timestamp":"2022-06-22T17:21:25.870Z","level":"INFO","logger":"org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils","message":"Recovering checkpoints from KubernetesStateHandleStore{configMapName='univex-flink-record-collector-46071c6a64e47d1ce828dfe032f943a6-jobmanager-leader'}."} {"timestamp":"2022-06-22T17:21:25.875Z","level":"INFO","logger":"org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils","message":"Found 1 checkpoints in KubernetesStateHandleStore{configMapName='univex-flink-record-collector-46071c6a64e47d1ce828dfe032f943a6-jobmanager-leader'}."} {"timestamp":"2022-06-22T17:21:25.876Z","level":"INFO","logger":"org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils","message":"Trying to fetch 1 checkpoints from storage."} {"timestamp":"2022-06-22T17:21:25.876Z","level":"INFO","logger":"org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils","message":"Trying to retrieve checkpoint 58130."} {"timestamp":"2022-06-22T17:21:25.901Z","level":"ERROR","logger":"org.apache.flink.runtime.entrypoint.ClusterEntrypoint","message":"Fatal error occurred in the cluster entrypoint.","level":"INFO","logger":"org.apache.flink.runtime.entrypoint.ClusterEntrypoint","message":"Shutting StandaloneSessionClusterEntrypoint down with application status UNKNOWN. Diagnostics Cluster entrypoint has been closed externally.."} {"timestamp":"2022-06-22T17:21:25.921Z","level":"INFO","logger":"org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint","message":"Shutting down rest endpoint."} {"timestamp":"2022-06-22T17:21:25.922Z","level":"INFO","logger":"org.apache.flink.runtime.blob.BlobServer","message":"Stopped BLOB server at 0.0.0.0:6124"} {noformat} The stacktrace of the ERROR: {noformat} org.apache.flink.util.FlinkException: JobMaster for job 46071c6a64e47d1ce828dfe032f943a6 failed. at org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:913) at org.apache.flink.runtime.dispatcher.Dispatcher.jobManagerRunnerFailed(Dispatcher.java:473) at org.apache.flink.runtime.dispatcher.Dispatcher.handleJobManagerRunnerResult(Dispatcher.java:450) at org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$3(Dispatcher.java:427) at java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) at java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) at java.base/java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:455) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:455) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:537) at akka.actor.Actor.aroundReceive$(Actor.scala:535) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at akka.actor.ActorCell.invoke(ActorCell.scala:548) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at akka.dispatch.Mailbox.run(Mailbox.scala:231) at akka.dispatch.Mailbox.exec(Mailbox.scala:243) at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) Caused by: org.apache.flink.runtime.client.JobInitializationException: Could not start the JobMaster. at org.apache.flink.runtime.jobmaster.DefaultJobMasterServiceProcess.lambda$new$0(DefaultJobMasterServiceProcess.java:97) at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1705) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.CompletionException: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Failed to initialize high-availability completed checkpoint store at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:314) at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:319) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1702) ... 3 more Caused by: java.lang.RuntimeException: org.apache.flink.runtime.client.JobExecutionException: Failed to initialize high-availability completed checkpoint store at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:114) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700) ... 3 more Caused by: org.apache.flink.runtime.client.JobExecutionException: Failed to initialize high-availability completed checkpoint store at org.apache.flink.runtime.scheduler.SchedulerUtils.createCompletedCheckpointStoreIfCheckpointingIsEnabled(SchedulerUtils.java:57) at org.apache.flink.runtime.scheduler.SchedulerBase.<init>(SchedulerBase.java:180) at org.apache.flink.runtime.scheduler.DefaultScheduler.<init>(DefaultScheduler.java:140) at org.apache.flink.runtime.scheduler.DefaultSchedulerFactory.createInstance(DefaultSchedulerFactory.java:134) at org.apache.flink.runtime.jobmaster.DefaultSlotPoolServiceSchedulerFactory.createScheduler(DefaultSlotPoolServiceSchedulerFactory.java:110) at org.apache.flink.runtime.jobmaster.JobMaster.createScheduler(JobMaster.java:346) at org.apache.flink.runtime.jobmaster.JobMaster.<init>(JobMaster.java:323) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.internalCreateJobMasterService(DefaultJobMasterServiceFactory.java:106) at org.apache.flink.runtime.jobmaster.factories.DefaultJobMasterServiceFactory.lambda$createJobMasterService$0(DefaultJobMasterServiceFactory.java:94) at org.apache.flink.util.function.FunctionUtils.lambda$uncheckedSupplier$4(FunctionUtils.java:112) ... 4 more Caused by: org.apache.flink.util.FlinkException: Could not retrieve checkpoint 58130 from state handle under checkpointID-0000000000000058130. This indicates that the retrieved state handle is broken. Try cleaning the state handle store. at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStoreUtils.java:111) at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils.retrieveCompletedCheckpoints(DefaultCompletedCheckpointStoreUtils.java:89) at org.apache.flink.kubernetes.utils.KubernetesUtils.createCompletedCheckpointStore(KubernetesUtils.java:314) at org.apache.flink.kubernetes.highavailability.KubernetesCheckpointRecoveryFactory.createRecoveredCompletedCheckpointStore(KubernetesCheckpointRecoveryFactory.java:78) at org.apache.flink.runtime.scheduler.SchedulerUtils.createCompletedCheckpointStore(SchedulerUtils.java:91) at org.apache.flink.runtime.scheduler.SchedulerUtils.createCompletedCheckpointStoreIfCheckpointingIsEnabled(SchedulerUtils.java:54) ... 13 more Caused by: java.io.EOFException at java.base/java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2872) at java.base/java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3367) at java.base/java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:936) at java.base/java.io.ObjectInputStream.<init>(ObjectInputStream.java:379) at org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.<init>(InstantiationUtil.java:68) at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:612) at org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:595) at org.apache.flink.runtime.state.RetrievableStreamStateHandle.retrieveState(RetrievableStreamStateHandle.java:59) at org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStoreUtils.retrieveCompletedCheckpoint(DefaultCompletedCheckpointStoreUtils.java:102) ... 18 more {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007)