[ 
https://issues.apache.org/jira/browse/FLINK-18367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

An updated FLINK-18367:
-----------------------
    Description: 
The issue is similar to https://issues.apache.org/jira/browse/FLINK-12382

I'm testing zetcd + session jobs in k8s. Have 2 job managers and 2 
taskmanagers. Everything works fine, but after I delete the pod with the job 
manager leader, task managers not always can register itselves at the new 
leader. The following exception occurs:
{code:java}
2020-06-18 13:02:43,555 [Thread=flink-akka.actor.default-dispatcher-3] ERROR 
org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at 
ResourceManager failed due to an error
 java.util.concurrent.CompletionException: 
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
not set: Ignoring message RemoteFencedMessage(bcb7d4652fe53a2f8997dc8c87d641a7, 
RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) sent 
to akka.tcp://flink@poc-ha-walle-flink-jobmanager:50010/user/resourcemanager 
because the fencing token is null.
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 {code}
Task managers receive notification that leader was changed but seems 
RpcEndpoint can't refresh fence token for some reason 

 

Attached full log from the task manager pod

  was:
The issue is similar to https://issues.apache.org/jira/browse/FLINK-12382

I'm testing zetcd + session jobs in k8s. Have 2 job managers and 2 
taskmanagers. Everything works fine, but after I delete the pod with the job 
manager leader, task managers not always can register itselves at the new 
leader. The following exception occurs:

´2020-06-18 13:02:43,555 [Thread=flink-akka.actor.default-dispatcher-3] ERROR 
org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at 
ResourceManager failed due to an error
java.util.concurrent.CompletionException: 
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
not set: Ignoring message RemoteFencedMessage(bcb7d4652fe53a2f8997dc8c87d641a7, 
RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) sent 
to akka.tcp://flink@poc-ha-walle-flink-jobmanager:50010/user/resourcemanager 
because the fencing token is null.
 at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
 at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
 at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
 at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
 at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
 at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
 ´

Task managers receive notification that leader was changed but seems 
RpcEndpoint can't refresh fence token for some reason 

 

Attached full log from the task manager pod


> Flink HA Mode in Kubernetes. Fencing token not set
> --------------------------------------------------
>
>                 Key: FLINK-18367
>                 URL: https://issues.apache.org/jira/browse/FLINK-18367
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.1
>            Reporter: An
>            Priority: Critical
>         Attachments: taskmanager.log
>
>
> The issue is similar to https://issues.apache.org/jira/browse/FLINK-12382
> I'm testing zetcd + session jobs in k8s. Have 2 job managers and 2 
> taskmanagers. Everything works fine, but after I delete the pod with the job 
> manager leader, task managers not always can register itselves at the new 
> leader. The following exception occurs:
> {code:java}
> 2020-06-18 13:02:43,555 [Thread=flink-akka.actor.default-dispatcher-3] ERROR 
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at 
> ResourceManager failed due to an error
>  java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
> not set: Ignoring message 
> RemoteFencedMessage(bcb7d4652fe53a2f8997dc8c87d641a7, 
> RemoteRpcInvocation(registerTaskExecutor(TaskExecutorRegistration, Time))) 
> sent to 
> akka.tcp://flink@poc-ha-walle-flink-jobmanager:50010/user/resourcemanager 
> because the fencing token is null.
>  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
>  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
>  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  {code}
> Task managers receive notification that leader was changed but seems 
> RpcEndpoint can't refresh fence token for some reason 
>  
> Attached full log from the task manager pod



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to