Zhenqiu Huang created FLINK-34007: ------------------------------------- Summary: Flink Job stuck in suspend state after recovery from failure in HA Mode Key: FLINK-34007 URL: https://issues.apache.org/jira/browse/FLINK-34007 Project: Flink Issue Type: Bug Components: Runtime / Coordination Affects Versions: 1.18.1, 1.18.2 Reporter: Zhenqiu Huang
The observation is that Job manager goes to suspend state with a failed container not able to register itself to resource manager after timeout. JM Log: 2024-01-04 02:58:39,210 INFO org.apache.flink.runtime.jobmaster.JobMasterServiceLeadershipRunner [] - JobMasterServiceLeadershipRunner for job 217cee964b2cfdc3115fb74cac0ec550 was revoked leadership with leader id eda6fee6-ce02-4076-9a99-8c43a92629f7. Stopping current JobMasterServiceProcess. 2024-01-04 02:58:58,347 INFO org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] - http://172.16.71.11:8081 lost leadership 2024-01-04 02:58:58,347 INFO org.apache.flink.runtime.resourcemanager.ResourceManagerServiceImpl [] - Resource manager service is revoked leadership with session id eda6fee6-ce02-4076-9a99-8c43a92629f7. 2024-01-04 02:58:58,348 INFO org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner [] - DefaultDispatcherRunner was revoked the leadership with leader id eda6fee6-ce02-4076-9a99-8c43a92629f7. Stopping the DispatcherLeaderProcess. 2024-01-04 02:58:58,348 INFO org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess [] - Stopping SessionDispatcherLeaderProcess. 2024-01-04 02:58:58,349 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping dispatcher pekko.tcp://flink@172.16.71.11:6123/user/rpc/dispatcher_1. 2024-01-04 02:58:58,349 INFO org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster for job 'amp-ade-fitness-clickstream-projection-uat' (217cee964b2cfdc3115fb74cac0ec550). 2024-01-04 02:58:58,349 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all currently running jobs of dispatcher pekko.tcp://flink@172.16.71.11:6123/user/rpc/dispatcher_1. 2024-01-04 02:58:58,351 INFO org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job 217cee964b2cfdc3115fb74cac0ec550 reached terminal state SUSPENDED. 2024-01-04 02:58:58,352 INFO org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - Stopping credential renewal 2024-01-04 02:58:58,352 INFO org.apache.flink.runtime.security.token.DefaultDelegationTokenManager [] - Stopped credential renewal 2024-01-04 02:58:58,352 INFO org.apache.flink.runtime.resourcemanager.slotmanager.FineGrainedSlotManager [] - Closing the slot manager. 2024-01-04 02:58:58,351 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job amp-ade-fitness-clickstream-projection-uat (217cee964b2cfdc3115fb74cac0ec550) switched from state RUNNING to SUSPENDED. org.apache.flink.util.FlinkException: AdaptiveScheduler is being stopped. at org.apache.flink.runtime.scheduler.adaptive.AdaptiveScheduler.closeAsync(AdaptiveScheduler.java:474) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] at org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:1093) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] at org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:1056) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:454) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:239) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.lambda$terminate$0(PekkoRpcActor.java:574) ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor$StartedState.terminate(PekkoRpcActor.java:573) ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleControlMessage(PekkoRpcActor.java:196) ~[flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) [flink-rpc-akkadb952114-fa83-4aba-b20a-b7e5771ce59c.jar:1.18.1.6-ase] TM Error Log: 2024-01-04 11:23:01,334 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Fatal error occurred in TaskExecutor pekko.tcp://flink@172.16.182.165:6122/user/rpc/taskmanager_0. │ │ org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration PT5M. This indicates a p │ │ at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1558) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] │ │ at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$18(TaskExecutor.java:1543) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] │ │ at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.lambda$handleRunAsync$4(PekkoRpcActor.java:451) ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.flink.runtime.concurrent.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) ~[flink-dist-1.18.1.6-ase.jar:1.18.1.6-ase] │ │ at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRunAsync(PekkoRpcActor.java:451) ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleRpcMessage(PekkoRpcActor.java:218) ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.flink.runtime.rpc.pekko.PekkoRpcActor.handleMessage(PekkoRpcActor.java:168) ~[flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:33) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:29) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:29) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:175) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:176) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.actor.Actor.aroundReceive(Actor.scala:547) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.actor.Actor.aroundReceive$(Actor.scala:545) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.actor.AbstractActor.aroundReceive(AbstractActor.scala:229) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.actor.ActorCell.receiveMessage(ActorCell.scala:590) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.actor.ActorCell.invoke(ActorCell.scala:557) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.dispatch.Mailbox.processMailbox(Mailbox.scala:280) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.dispatch.Mailbox.run(Mailbox.scala:241) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at org.apache.pekko.dispatch.Mailbox.exec(Mailbox.scala:253) [flink-rpc-akkafc46731f-d444-4345-bad9-337cdcb657e4.jar:1.18.1.6-ase] │ │ at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) [?:?] │ │ at java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) [?:?] │ │ at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) [?:?] │ │ at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) [?:?] │ │ at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) [?:?] -- This message was sent by Atlassian Jira (v8.20.10#820010)