Re: Zookeeper failure handling

Gyula Fóra Mon, 25 Sep 2017 07:43:57 -0700

Curator seems to auto reconnect anyways, the problem might be that there is
a new leader elected before the old JM could reconnect. We will try to
experiment with this tomorrow to see if increasing the timeouts do any good.


Gyula

Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 25., H,
15:39):

> I will try to check what Stephan suggested and get back to you!
>
> Thanks for the feedback
>
> Gyula
>
> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> wrote:
>
>> I think the question is whether the connection should be lost in the case
>> of a rolling ZK update.
>>
>> There should always be a quorum online, so Curator should always be able
>> to
>> connect. So there is no need to revoke leadership.
>>
>> @gyula - can you check whether there is an option in Curator to reconnect
>> to another quorum peer if one goes down?
>>
>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>> > Hi Gyula,
>> >
>> > Flink uses internally the Curator LeaderLatch recipe to do leader
>> election.
>> > The LeaderLatch will revoke the leadership of a contender in case of a
>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The assumption
>> here
>> > is that if you cannot talk to ZooKeeper, then we can no longer be sure
>> that
>> > you are the leader.
>> >
>> > Consequently, if you do a rolling update of your ZooKeeper cluster which
>> > causes client connections to be lost or suspended, then it will trigger
>> a
>> > restart of the Flink job upon reacquiring the leadership again.
>> >
>> > Cheers,
>> > Till
>> >
>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <gyula.f...@gmail.com>
>> wrote:
>> >
>> > > We are using 1.3.2
>> > >
>> > > Gyula
>> > >
>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com> wrote:
>> > >
>> > > > Which release are you using ?
>> > > >
>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader election
>> > issues.
>> > > >
>> > > > Mind giving 1.3.2 a try ?
>> > > >
>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <gyula.f...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > We have observed that in case some nodes of the ZK cluster are
>> > > restarted
>> > > > > (for a rolling restart) the Flink Streaming jobs fail (and
>> restart).
>> > > > >
>> > > > > Log excerpt:
>> > > > >
>> > > > > 2017-09-22 12:54:41,426 INFO  org.apache.zookeeper.ClientCnxn
>> > > > >                      - Unable to read additional data from server
>> > > > > sessionid 0x15cba6e1a239774, likely server has closed socket,
>> closing
>> > > > > socket connection and attempting reconnect
>> > > > > 2017-09-22 12:54:41,527 INFO
>> > > > > org.apache.flink.shaded.org.apache.curator.framework.
>> > > > > state.ConnectionStateManager
>> > > > >  - State change: SUSPENDED
>> > > > > 2017-09-22 12:54:41,528 WARN
>> > > > > org.apache.flink.runtime.leaderelection.
>> > ZooKeeperLeaderElectionService
>> > > > >  - Connection to ZooKeeper suspended. The contender
>> > > > > akka.tcp://fl...@splat.sto.midasplayer.com:42118/user/jobmanager
>> no
>> > > > > longer participates in the leader election.
>> > > > > 2017-09-22 12:54:41,528 WARN
>> > > > > org.apache.flink.runtime.leaderretrieval.
>> > > ZooKeeperLeaderRetrievalService
>> > > > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
>> > > > > leader from ZooKeeper.
>> > > > > 2017-09-22 12:54:41,528 WARN
>> > > > > org.apache.flink.runtime.leaderretrieval.
>> > > ZooKeeperLeaderRetrievalService
>> > > > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
>> > > > > leader from ZooKeeper.
>> > > > > 2017-09-22 12:54:41,530 WARN
>> > > > >
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
>> > -
>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted job
>> graphs
>> > > > > are not monitored (temporarily).
>> > > > > 2017-09-22 12:54:41,530 INFO  org.apache.flink.yarn.YarnJobManager
>> > > > >                      - JobManager
>> > > > > akka://flink/user/jobmanager#-317276879 was revoked leadership.
>> > > > > 2017-09-22 12:54:41,532 INFO
>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>> Job
>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched from
>> > state
>> > > > > RUNNING to SUSPENDED.
>> > > > > java.lang.Exception: JobManager is no longer the leader.
>> > > > >
>> > > > >
>> > > > > Is this the expected behaviour?
>> > > > >
>> > > > > Thanks,
>> > > > > Gyula
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Zookeeper failure handling

Reply via email to