[ https://issues.apache.org/jira/browse/FLINK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
An updated FLINK-18367: ----------------------- Description: The issue is similar to https://issues.apache.org/jira/browse/FLINK-12382 I'm testing zetcd + session jobs in k8s. Have 2 job managers and 2 taskmanagers. Everything works fine, but after I delete the pod with the job manager leader, task managers not always can register itselves at the new leader. The following exception occurs: {code:java} 2020-06-18 13:02:43,555 [Thread=flink-akka.actor.default-dispatcher-3] ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager failed due to an error java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(bcb7d4652fe53a2f8997dc8c87d641a7, RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) sent to akka.tcp://flink@poc-ha-walle-flink-jobmanager:50010/user/resourcemanager because the fencing token is null. at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) {code} Task managers receive notification that leader was changed but seems RpcEndpoint can't refresh fence token for some reason Attached full log from the task manager pod was: The issue is similar to https://issues.apache.org/jira/browse/FLINK-12382 I'm testing zetcd + session jobs in k8s. Have 2 job managers and 2 taskmanagers. Everything works fine, but after I delete the pod with the job manager leader, task managers not always can register itselves at the new leader. The following exception occurs: ´2020-06-18 13:02:43,555 [Thread=flink-akka.actor.default-dispatcher-3] ERROR org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager failed due to an error java.util.concurrent.CompletionException: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token not set: Ignoring message RemoteFencedMessage(bcb7d4652fe53a2f8997dc8c87d641a7, RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) sent to akka.tcp://flink@poc-ha-walle-flink-jobmanager:50010/user/resourcemanager because the fencing token is null. at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ´ Task managers receive notification that leader was changed but seems RpcEndpoint can't refresh fence token for some reason Attached full log from the task manager pod > Flink HA Mode in Kubernetes. Fencing token not set > -------------------------------------------------- > > Key: FLINK-18367 > URL: https://issues.apache.org/jira/browse/FLINK-18367 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination > Affects Versions: 1.10.1 > Reporter: An > Priority: Critical > Attachments: taskmanager.log > > > The issue is similar to https://issues.apache.org/jira/browse/FLINK-12382 > I'm testing zetcd + session jobs in k8s. Have 2 job managers and 2 > taskmanagers. Everything works fine, but after I delete the pod with the job > manager leader, task managers not always can register itselves at the new > leader. The following exception occurs: > {code:java} > 2020-06-18 13:02:43,555 [Thread=flink-akka.actor.default-dispatcher-3] ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at > ResourceManager failed due to an error > java.util.concurrent.CompletionException: > org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token > not set: Ignoring message > RemoteFencedMessage(bcb7d4652fe53a2f8997dc8c87d641a7, > RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) > sent to > akka.tcp://flink@poc-ha-walle-flink-jobmanager:50010/user/resourcemanager > because the fencing token is null. > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > {code} > Task managers receive notification that leader was changed but seems > RpcEndpoint can't refresh fence token for some reason > > Attached full log from the task manager pod -- This message was sent by Atlassian Jira (v8.3.4#803005)