Re: Zookeeper failure handling

Gyula Fóra Tue, 26 Sep 2017 09:34:59 -0700

Hi,

I did some experimenting and found something that is interesting and looks
off.


So the only problem is when the ZK leader is restarted, not related to any
retry/reconnect logic (not affected by the timeout setting).
I think the following is happening (based on the logs
https://gist.github.com/gyfora/acb55e380d932ac10593fc1fd37930ab):

1. Connection is suspended, notLeader method is called  -> revokes
leadership without checking anything, kills jobs
2. Reconnects , isLeader and confirmLeaderSessionID methods are called
(before nodeChanged) -> Overwrites old confirmed session id in ZK with the
new one before checking (making recovery impossible in nodeChanged)

I am probably not completely aware of the subtleties of this problem but it
seems to me that we should not immediately revoke leadership and fail jobs
on suspended, and also it would be nice if nodeChanged would be called
before confirmLeaderSessionID.

Could someone with more experience please take a look as well?

Thanks!
Gyula

Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 25., H,
16:43):

> Curator seems to auto reconnect anyways, the problem might be that there
> is a new leader elected before the old JM could reconnect. We will try to
> experiment with this tomorrow to see if increasing the timeouts do any good.
>
> Gyula
>
> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 25., H,
> 15:39):
>
>> I will try to check what Stephan suggested and get back to you!
>>
>> Thanks for the feedback
>>
>> Gyula
>>
>> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> wrote:
>>
>>> I think the question is whether the connection should be lost in the case
>>> of a rolling ZK update.
>>>
>>> There should always be a quorum online, so Curator should always be able
>>> to
>>> connect. So there is no need to revoke leadership.
>>>
>>> @gyula - can you check whether there is an option in Curator to reconnect
>>> to another quorum peer if one goes down?
>>>
>>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <trohrm...@apache.org>
>>> wrote:
>>>
>>> > Hi Gyula,
>>> >
>>> > Flink uses internally the Curator LeaderLatch recipe to do leader
>>> election.
>>> > The LeaderLatch will revoke the leadership of a contender in case of a
>>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The assumption
>>> here
>>> > is that if you cannot talk to ZooKeeper, then we can no longer be sure
>>> that
>>> > you are the leader.
>>> >
>>> > Consequently, if you do a rolling update of your ZooKeeper cluster
>>> which
>>> > causes client connections to be lost or suspended, then it will
>>> trigger a
>>> > restart of the Flink job upon reacquiring the leadership again.
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <gyula.f...@gmail.com>
>>> wrote:
>>> >
>>> > > We are using 1.3.2
>>> > >
>>> > > Gyula
>>> > >
>>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com> wrote:
>>> > >
>>> > > > Which release are you using ?
>>> > > >
>>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader election
>>> > issues.
>>> > > >
>>> > > > Mind giving 1.3.2 a try ?
>>> > > >
>>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <gyula.f...@gmail.com>
>>> > > wrote:
>>> > > >
>>> > > > > Hi all,
>>> > > > >
>>> > > > > We have observed that in case some nodes of the ZK cluster are
>>> > > restarted
>>> > > > > (for a rolling restart) the Flink Streaming jobs fail (and
>>> restart).
>>> > > > >
>>> > > > > Log excerpt:
>>> > > > >
>>> > > > > 2017-09-22 12:54:41,426 INFO  org.apache.zookeeper.ClientCnxn
>>> > > > >                      - Unable to read additional data from server
>>> > > > > sessionid 0x15cba6e1a239774, likely server has closed socket,
>>> closing
>>> > > > > socket connection and attempting reconnect
>>> > > > > 2017-09-22 12:54:41,527 INFO
>>> > > > > org.apache.flink.shaded.org.apache.curator.framework.
>>> > > > > state.ConnectionStateManager
>>> > > > >  - State change: SUSPENDED
>>> > > > > 2017-09-22 12:54:41,528 WARN
>>> > > > > org.apache.flink.runtime.leaderelection.
>>> > ZooKeeperLeaderElectionService
>>> > > > >  - Connection to ZooKeeper suspended. The contender
>>> > > > > akka.tcp://fl...@splat.sto.midasplayer.com:42118/user/jobmanager
>>> no
>>> > > > > longer participates in the leader election.
>>> > > > > 2017-09-22 12:54:41,528 WARN
>>> > > > > org.apache.flink.runtime.leaderretrieval.
>>> > > ZooKeeperLeaderRetrievalService
>>> > > > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
>>> > > > > leader from ZooKeeper.
>>> > > > > 2017-09-22 12:54:41,528 WARN
>>> > > > > org.apache.flink.runtime.leaderretrieval.
>>> > > ZooKeeperLeaderRetrievalService
>>> > > > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
>>> > > > > leader from ZooKeeper.
>>> > > > > 2017-09-22 12:54:41,530 WARN
>>> > > > >
>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
>>> > -
>>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted job
>>> graphs
>>> > > > > are not monitored (temporarily).
>>> > > > > 2017-09-22 12:54:41,530 INFO
>>> org.apache.flink.yarn.YarnJobManager
>>> > > > >                      - JobManager
>>> > > > > akka://flink/user/jobmanager#-317276879 was revoked leadership.
>>> > > > > 2017-09-22 12:54:41,532 INFO
>>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>> Job
>>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched from
>>> > state
>>> > > > > RUNNING to SUSPENDED.
>>> > > > > java.lang.Exception: JobManager is no longer the leader.
>>> > > > >
>>> > > > >
>>> > > > > Is this the expected behaviour?
>>> > > > >
>>> > > > > Thanks,
>>> > > > > Gyula
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: Zookeeper failure handling

Reply via email to