[ https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Henrik updated FLINK-12382: --------------------------- Description: I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the job-cluster and task-manager at the same time, but maintain ZK/zetcd up and running. Then I get this stacktrace, that's completely non-actionable for me, and also resolves itself. I expect a number of retries, and if this exception is part of the protocol signalling to retry, then it should not be printed as a log entry. This might be related to an older bug: [https://jira.apache.org/jira/browse/FLINK-7734] {code:java} [tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager failed due to an error [tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) [tm] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) [tm] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) [tm] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) [tm] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) [tm] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) [tm] at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815) [tm] at akka.dispatch.OnComplete.internal(Future.scala:258) [tm] at akka.dispatch.OnComplete.internal(Future.scala:256) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) [tm] at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) [tm] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) [tm] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) [tm] at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] ... 9 more [tm] 2019-05-01 11:32:01,650 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Pausing and re-attempting registration in 10000 ms [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 00000000000000000000000000000000. [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692). [tm] 2019-05-01 11:32:03,071 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692) switched from RUNNING to FAILED. [tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000 lost the leadership. [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182) [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136) [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) [tm] ... 15 more [tm] 2019-05-01 11:32:03,071 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692). [tm] 2019-05-01 11:32:03,085 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25). [tm] 2019-05-01 11:32:03,085 INFO org.apache.flink.runtime.taskmanager.Task - user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25) switched from RUNNING to FAILED. [tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000 lost the leadership. [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182) [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136) [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) [tm] ... 15 more {code} tm stands for taskmanager in this deployment. EDIT: this also happens if you just temporarily disable network routing; it never recovers on its own despite having HA configured! In this case, it's the job manager that keeps crashing. {code:java} [job] 2019-05-01 13:03:32,299 ERROR org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Unhandled exception. [job] org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message LocalFencedMessage(aa2545e3e2ca903b1a0f331235954917, LocalRpcInvocation(requestMultipleJobDetails(Time))) sent to akka.tcp://flink@analytics-job:6123/user/dispatcher because the fencing token is null. [job] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) [job] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [job] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) [job] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [job] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [job] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [job] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [job] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [job] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [job] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [job] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [job] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [job] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [job] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [job] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [job] 2019-05-01 13:03:33,308 ERROR org.apache.flink.runtime.rest.handler.job.JobDetailsHandler - Unhandled exception.{code} was: I'm testing zetcd + session jobs in k8s, and testing what happens when I kill both the job-cluster and task-manager at the same time, but maintain ZK/zetcd up and running. Then I get this stacktrace, that's completely non-actionable for me, and also resolves itself. I expect a number of retries, and if this exception is part of the protocol signalling to retry, then it should not be printed as a log entry. This might be related to an older bug: https://jira.apache.org/jira/browse/FLINK-7734 {code:java} [tm] 2019-05-01 11:32:01,641 ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager failed due to an error [tm] java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) [tm] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) [tm] at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) [tm] at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) [tm] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) [tm] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) [tm] at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815) [tm] at akka.dispatch.OnComplete.internal(Future.scala:258) [tm] at akka.dispatch.OnComplete.internal(Future.scala:256) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) [tm] at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) [tm] at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) [tm] at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) [tm] at akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, HardwareDescription, Time))) sent to akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing token is null. [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] ... 9 more [tm] 2019-05-01 11:32:01,650 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Pausing and re-attempting registration in 10000 ms [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 00000000000000000000000000000000. [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692). [tm] 2019-05-01 11:32:03,071 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692) switched from RUNNING to FAILED. [tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000 lost the leadership. [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182) [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136) [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) [tm] ... 15 more [tm] 2019-05-01 11:32:03,071 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: Custom Source -> Process -> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692). [tm] 2019-05-01 11:32:03,085 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25). [tm] 2019-05-01 11:32:03,085 INFO org.apache.flink.runtime.taskmanager.Task - user_sessions -> (Sink: sink_example_sessions, Filter, Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25) switched from RUNNING to FAILED. [tm] org.apache.flink.util.FlinkException: JobManager responsible for 00000000000000000000000000000000 lost the leadership. [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182) [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136) [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185) [tm] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) [tm] at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) [tm] at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [tm] at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [tm] at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [tm] at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. [tm] at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) [tm] ... 15 more {code} tm stands for taskmanager in this deployment. > HA + ResourceManager exception: Fencing token not set > ----------------------------------------------------- > > Key: FLINK-12382 > URL: https://issues.apache.org/jira/browse/FLINK-12382 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.8.0 > Environment: Same all all previous bugs filed by myself, today, but > this time with HA with zetcd. > Reporter: Henrik > Priority: Major > > I'm testing zetcd + session jobs in k8s, and testing what happens when I kill > both the job-cluster and task-manager at the same time, but maintain ZK/zetcd > up and running. > Then I get this stacktrace, that's completely non-actionable for me, and also > resolves itself. I expect a number of retries, and if this exception is part > of the protocol signalling to retry, then it should not be printed as a log > entry. > This might be related to an older bug: > [https://jira.apache.org/jira/browse/FLINK-7734] > {code:java} > [tm] 2019-05-01 11:32:01,641 ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration > at ResourceManager failed due to an error > [tm] java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > not set: Ignoring message > RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, > RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, > HardwareDescription, Time))) sent to > akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing > token is null. > [tm] at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > [tm] at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > [tm] at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > [tm] at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > [tm] at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > [tm] at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > [tm] at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815) > [tm] at akka.dispatch.OnComplete.internal(Future.scala:258) > [tm] at akka.dispatch.OnComplete.internal(Future.scala:256) > [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > [tm] at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) > [tm] at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > [tm] at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534) > [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97) > [tm] at > akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982) > [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446) > [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) > [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) > [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > [tm] at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [tm] at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [tm] Caused by: > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > not set: Ignoring message > RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, > RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, > HardwareDescription, Time))) sent to > akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing > token is null. > [tm] at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) > [tm] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) > [tm] at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > [tm] at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > [tm] ... 9 more > [tm] 2019-05-01 11:32:01,650 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Pausing and > re-attempting registration in 10000 ms > [tm] 2019-05-01 11:32:03,070 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat > of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. > [tm] 2019-05-01 11:32:03,070 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Close > JobManager connection for job 00000000000000000000000000000000. > [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to fail task externally Source: Custom Source > -> Process -> Timestamps/Watermarks -> app_events (1/1) > (a302013f150f292067cd498100dc6692). > [tm] 2019-05-01 11:32:03,071 INFO org.apache.flink.runtime.taskmanager.Task > - Source: Custom Source -> Process -> > Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692) > switched from RUNNING to FAILED. > [tm] org.apache.flink.util.FlinkException: JobManager responsible for > 00000000000000000000000000000000 lost the leadership. > [tm] at > org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182) > [tm] at > org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136) > [tm] at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624) > [tm] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392) > [tm] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185) > [tm] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) > [tm] at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) > [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) > [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > [tm] at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [tm] at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of > JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. > [tm] at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) > [tm] ... 15 more > [tm] 2019-05-01 11:32:03,071 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code Source: Custom > Source -> Process -> Timestamps/Watermarks -> app_events (1/1) > (a302013f150f292067cd498100dc6692). > [tm] 2019-05-01 11:32:03,085 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to fail task externally user_sessions -> > (Sink: sink_example_sessions, Filter, Filter) (1/1) > (dbb8434fb24a04b8890520d4e59bbd25). > [tm] 2019-05-01 11:32:03,085 INFO org.apache.flink.runtime.taskmanager.Task > - user_sessions -> (Sink: sink_example_sessions, Filter, > Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25) switched from RUNNING to > FAILED. > [tm] org.apache.flink.util.FlinkException: JobManager responsible for > 00000000000000000000000000000000 lost the leadership. > [tm] at > org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182) > [tm] at > org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136) > [tm] at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624) > [tm] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392) > [tm] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185) > [tm] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) > [tm] at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495) > [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224) > [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > [tm] at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [tm] at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [tm] at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of > JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out. > [tm] at > org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625) > [tm] ... 15 more > {code} > tm stands for taskmanager in this deployment. > EDIT: this also happens if you just temporarily disable network routing; it > never recovers on its own despite having HA configured! In this case, it's > the job manager that keeps crashing. > {code:java} > [job] 2019-05-01 13:03:32,299 ERROR > org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Unhandled > exception. > [job] org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing > token not set: Ignoring message > LocalFencedMessage(aa2545e3e2ca903b1a0f331235954917, > LocalRpcInvocation(requestMultipleJobDetails(Time))) sent to > akka.tcp://flink@analytics-job:6123/user/dispatcher because the fencing token > is null. > [job] at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63) > [job] at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147) > [job] at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40) > [job] at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > [job] at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > [job] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > [job] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > [job] at akka.actor.ActorCell.invoke(ActorCell.scala:495) > [job] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > [job] at akka.dispatch.Mailbox.run(Mailbox.scala:224) > [job] at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > [job] at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > [job] at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > [job] at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > [job] at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > [job] 2019-05-01 13:03:33,308 ERROR > org.apache.flink.runtime.rest.handler.job.JobDetailsHandler - Unhandled > exception.{code} > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)