[ 
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Wozniakowski updated FLINK-10475:
----------------------------------------
    Description: 
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.

Please give me a shout if I can provide any more useful information

EDIT

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 

  was:
Hey Guys,

Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
mode, but now I'm seeing a different issue.

It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
version. I then proceeded to kill the leading jobmanager to test the failover.

The remaining jobmanagers never triggered a leader election, and simply got 
stuck.
The logs of the remaining job managers were full of this:

{quote}
2018-10-01 15:35:44,558 ERROR 
org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Could not 
retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://flink@10.1.3.118:50010/user/dispatcher#-1286445443]] after 
[10000 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
        at akka.dispatch.OnComplete.internal(Future.scala:258)
        at akka.dispatch.OnComplete.internal(Future.scala:256)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
        at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
        at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
        at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
        at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
        at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
        at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
        at 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
        at 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
        at 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
        at 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
        at java.lang.Thread.run(Thread.java:745)
{quote}

Please give me a shout if I can provide any more useful information

Jobmanager logs attached below. You can see that I brought up a fresh cluster, 
one JM was elected leader (no taskmanagers or actual jobs in this case). I then 
let the cluster sit there for half an hour or so, before killing the leader. 
The log files are snapshotted maybe half an hour after that. You can see that a 
second election was never triggered.

In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
previously worked with 1.4.3. 


> Standalone HA - Leader election is not triggered on loss of leader
> ------------------------------------------------------------------
>
>                 Key: FLINK-10475
>                 URL: https://issues.apache.org/jira/browse/FLINK-10475
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.6.1, 1.5.4
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>         Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of 
> jobgraphs hanging around forever has been resolved in standalone/zookeeper HA 
> mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of 
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new 
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got 
> stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh 
> cluster, one JM was elected leader (no taskmanagers or actual jobs in this 
> case). I then let the cluster sit there for half an hour or so, before 
> killing the leader. The log files are snapshotted maybe half an hour after 
> that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.4-beta". This setup 
> previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to