[ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16638747#comment-16638747 ]
Thomas Wozniakowski commented on FLINK-10475: --------------------------------------------- Sure - I can update the docs. I'll say that it's recommend to use *3.5.4-beta* or *3.4.13*. Sound reasonable? > Standalone HA - Leader election is not triggered on loss of leader (ZK > 3.5.3-beta only) > --------------------------------------------------------------------------------------- > > Key: FLINK-10475 > URL: https://issues.apache.org/jira/browse/FLINK-10475 > Project: Flink > Issue Type: Bug > Affects Versions: 1.6.1, 1.5.4 > Reporter: Thomas Wozniakowski > Priority: Minor > Attachments: t1.log, t2.log, t3.log > > > Hey Guys, > Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). > Happy to see that the issue of jobgraphs hanging around forever has been > resolved in standalone/zookeeper HA mode, but now I'm seeing a different > issue. > It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of > zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new > version. I then proceeded to kill the leading jobmanager to test the failover. > The remaining jobmanagers never triggered a leader election, and simply got > stuck. > Please give me a shout if I can provide any more useful information > EDIT > Jobmanager logs attached below. You can see that I brought up a fresh > cluster, one JM was elected leader (no taskmanagers or actual jobs in this > case). I then let the cluster sit there for half an hour or so, before > killing the leader. The log files are snapshotted maybe half an hour after > that. You can see that a second election was never triggered. > In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup > previously worked with 1.4.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005)