Curator seems to auto reconnect anyways, the problem might be that there is a new leader elected before the old JM could reconnect. We will try to experiment with this tomorrow to see if increasing the timeouts do any good.
Gyula Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 25., H, 15:39): > I will try to check what Stephan suggested and get back to you! > > Thanks for the feedback > > Gyula > > On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> wrote: > >> I think the question is whether the connection should be lost in the case >> of a rolling ZK update. >> >> There should always be a quorum online, so Curator should always be able >> to >> connect. So there is no need to revoke leadership. >> >> @gyula - can you check whether there is an option in Curator to reconnect >> to another quorum peer if one goes down? >> >> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <trohrm...@apache.org> >> wrote: >> >> > Hi Gyula, >> > >> > Flink uses internally the Curator LeaderLatch recipe to do leader >> election. >> > The LeaderLatch will revoke the leadership of a contender in case of a >> > SUSPENDED or LOST connection to the ZooKeeper quorum. The assumption >> here >> > is that if you cannot talk to ZooKeeper, then we can no longer be sure >> that >> > you are the leader. >> > >> > Consequently, if you do a rolling update of your ZooKeeper cluster which >> > causes client connections to be lost or suspended, then it will trigger >> a >> > restart of the Flink job upon reacquiring the leadership again. >> > >> > Cheers, >> > Till >> > >> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <gyula.f...@gmail.com> >> wrote: >> > >> > > We are using 1.3.2 >> > > >> > > Gyula >> > > >> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com> wrote: >> > > >> > > > Which release are you using ? >> > > > >> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader election >> > issues. >> > > > >> > > > Mind giving 1.3.2 a try ? >> > > > >> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <gyula.f...@gmail.com> >> > > wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > We have observed that in case some nodes of the ZK cluster are >> > > restarted >> > > > > (for a rolling restart) the Flink Streaming jobs fail (and >> restart). >> > > > > >> > > > > Log excerpt: >> > > > > >> > > > > 2017-09-22 12:54:41,426 INFO org.apache.zookeeper.ClientCnxn >> > > > > - Unable to read additional data from server >> > > > > sessionid 0x15cba6e1a239774, likely server has closed socket, >> closing >> > > > > socket connection and attempting reconnect >> > > > > 2017-09-22 12:54:41,527 INFO >> > > > > org.apache.flink.shaded.org.apache.curator.framework. >> > > > > state.ConnectionStateManager >> > > > > - State change: SUSPENDED >> > > > > 2017-09-22 12:54:41,528 WARN >> > > > > org.apache.flink.runtime.leaderelection. >> > ZooKeeperLeaderElectionService >> > > > > - Connection to ZooKeeper suspended. The contender >> > > > > akka.tcp://fl...@splat.sto.midasplayer.com:42118/user/jobmanager >> no >> > > > > longer participates in the leader election. >> > > > > 2017-09-22 12:54:41,528 WARN >> > > > > org.apache.flink.runtime.leaderretrieval. >> > > ZooKeeperLeaderRetrievalService >> > > > > - Connection to ZooKeeper suspended. Can no longer retrieve the >> > > > > leader from ZooKeeper. >> > > > > 2017-09-22 12:54:41,528 WARN >> > > > > org.apache.flink.runtime.leaderretrieval. >> > > ZooKeeperLeaderRetrievalService >> > > > > - Connection to ZooKeeper suspended. Can no longer retrieve the >> > > > > leader from ZooKeeper. >> > > > > 2017-09-22 12:54:41,530 WARN >> > > > > >> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore >> > - >> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted job >> graphs >> > > > > are not monitored (temporarily). >> > > > > 2017-09-22 12:54:41,530 INFO org.apache.flink.yarn.YarnJobManager >> > > > > - JobManager >> > > > > akka://flink/user/jobmanager#-317276879 was revoked leadership. >> > > > > 2017-09-22 12:54:41,532 INFO >> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph - >> Job >> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched from >> > state >> > > > > RUNNING to SUSPENDED. >> > > > > java.lang.Exception: JobManager is no longer the leader. >> > > > > >> > > > > >> > > > > Is this the expected behaviour? >> > > > > >> > > > > Thanks, >> > > > > Gyula >> > > > > >> > > > >> > > >> > >> >