On a second iteration, the whole problem seems to stem from the fact that we revoke leadership from the JM when the notLeader method is called before waiting for a new leader to be elected. Ideally we should wait until isLeader is called again to check who was the previous leader but I can see how this might lead to split brain scenarios if the previous leader loses connection to ZK while still maintaining connection to the TMs.
Gyula Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 26., K, 18:34): > Hi, > > I did some experimenting and found something that is interesting and looks > off. > > So the only problem is when the ZK leader is restarted, not related to any > retry/reconnect logic (not affected by the timeout setting). > I think the following is happening (based on the logs > https://gist.github.com/gyfora/acb55e380d932ac10593fc1fd37930ab): > > 1. Connection is suspended, notLeader method is called -> revokes > leadership without checking anything, kills jobs > 2. Reconnects , isLeader and confirmLeaderSessionID methods are called > (before nodeChanged) -> Overwrites old confirmed session id in ZK with the > new one before checking (making recovery impossible in nodeChanged) > > I am probably not completely aware of the subtleties of this problem but > it seems to me that we should not immediately revoke leadership and fail > jobs on suspended, and also it would be nice if nodeChanged would be called > before confirmLeaderSessionID. > > Could someone with more experience please take a look as well? > > Thanks! > Gyula > > Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 25., H, > 16:43): > >> Curator seems to auto reconnect anyways, the problem might be that there >> is a new leader elected before the old JM could reconnect. We will try to >> experiment with this tomorrow to see if increasing the timeouts do any good. >> >> Gyula >> >> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 25., >> H, 15:39): >> >>> I will try to check what Stephan suggested and get back to you! >>> >>> Thanks for the feedback >>> >>> Gyula >>> >>> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> wrote: >>> >>>> I think the question is whether the connection should be lost in the >>>> case >>>> of a rolling ZK update. >>>> >>>> There should always be a quorum online, so Curator should always be >>>> able to >>>> connect. So there is no need to revoke leadership. >>>> >>>> @gyula - can you check whether there is an option in Curator to >>>> reconnect >>>> to another quorum peer if one goes down? >>>> >>>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> >>>> > Hi Gyula, >>>> > >>>> > Flink uses internally the Curator LeaderLatch recipe to do leader >>>> election. >>>> > The LeaderLatch will revoke the leadership of a contender in case of a >>>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The assumption >>>> here >>>> > is that if you cannot talk to ZooKeeper, then we can no longer be >>>> sure that >>>> > you are the leader. >>>> > >>>> > Consequently, if you do a rolling update of your ZooKeeper cluster >>>> which >>>> > causes client connections to be lost or suspended, then it will >>>> trigger a >>>> > restart of the Flink job upon reacquiring the leadership again. >>>> > >>>> > Cheers, >>>> > Till >>>> > >>>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <gyula.f...@gmail.com> >>>> wrote: >>>> > >>>> > > We are using 1.3.2 >>>> > > >>>> > > Gyula >>>> > > >>>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com> wrote: >>>> > > >>>> > > > Which release are you using ? >>>> > > > >>>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader election >>>> > issues. >>>> > > > >>>> > > > Mind giving 1.3.2 a try ? >>>> > > > >>>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <gyula.f...@gmail.com >>>> > >>>> > > wrote: >>>> > > > >>>> > > > > Hi all, >>>> > > > > >>>> > > > > We have observed that in case some nodes of the ZK cluster are >>>> > > restarted >>>> > > > > (for a rolling restart) the Flink Streaming jobs fail (and >>>> restart). >>>> > > > > >>>> > > > > Log excerpt: >>>> > > > > >>>> > > > > 2017-09-22 12:54:41,426 INFO org.apache.zookeeper.ClientCnxn >>>> > > > > - Unable to read additional data from >>>> server >>>> > > > > sessionid 0x15cba6e1a239774, likely server has closed socket, >>>> closing >>>> > > > > socket connection and attempting reconnect >>>> > > > > 2017-09-22 12:54:41,527 INFO >>>> > > > > org.apache.flink.shaded.org.apache.curator.framework. >>>> > > > > state.ConnectionStateManager >>>> > > > > - State change: SUSPENDED >>>> > > > > 2017-09-22 12:54:41,528 WARN >>>> > > > > org.apache.flink.runtime.leaderelection. >>>> > ZooKeeperLeaderElectionService >>>> > > > > - Connection to ZooKeeper suspended. The contender >>>> > > > > akka.tcp:// >>>> fl...@splat.sto.midasplayer.com:42118/user/jobmanager no >>>> > > > > longer participates in the leader election. >>>> > > > > 2017-09-22 12:54:41,528 WARN >>>> > > > > org.apache.flink.runtime.leaderretrieval. >>>> > > ZooKeeperLeaderRetrievalService >>>> > > > > - Connection to ZooKeeper suspended. Can no longer retrieve the >>>> > > > > leader from ZooKeeper. >>>> > > > > 2017-09-22 12:54:41,528 WARN >>>> > > > > org.apache.flink.runtime.leaderretrieval. >>>> > > ZooKeeperLeaderRetrievalService >>>> > > > > - Connection to ZooKeeper suspended. Can no longer retrieve the >>>> > > > > leader from ZooKeeper. >>>> > > > > 2017-09-22 12:54:41,530 WARN >>>> > > > > >>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore >>>> > - >>>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted job >>>> graphs >>>> > > > > are not monitored (temporarily). >>>> > > > > 2017-09-22 12:54:41,530 INFO >>>> org.apache.flink.yarn.YarnJobManager >>>> > > > > - JobManager >>>> > > > > akka://flink/user/jobmanager#-317276879 was revoked leadership. >>>> > > > > 2017-09-22 12:54:41,532 INFO >>>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph - >>>> Job >>>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched from >>>> > state >>>> > > > > RUNNING to SUSPENDED. >>>> > > > > java.lang.Exception: JobManager is no longer the leader. >>>> > > > > >>>> > > > > >>>> > > > > Is this the expected behaviour? >>>> > > > > >>>> > > > > Thanks, >>>> > > > > Gyula >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> >>>