Re: Zookeeper failure handling

Gyula Fóra Wed, 27 Sep 2017 00:22:40 -0700

On a second iteration, the whole problem seems to stem from the fact that
we revoke leadership from the JM when the notLeader method is called before
waiting for a new leader to be elected. Ideally we should wait until
isLeader is called again to check who was the previous leader but I can see
how this might lead to split brain scenarios if the previous leader loses
connection to ZK while still maintaining connection to the TMs.


Gyula

Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 26., K,
18:34):

> Hi,
>
> I did some experimenting and found something that is interesting and looks
> off.
>
> So the only problem is when the ZK leader is restarted, not related to any
> retry/reconnect logic (not affected by the timeout setting).
> I think the following is happening (based on the logs
> https://gist.github.com/gyfora/acb55e380d932ac10593fc1fd37930ab):
>
> 1. Connection is suspended, notLeader method is called  -> revokes
> leadership without checking anything, kills jobs
> 2. Reconnects , isLeader and confirmLeaderSessionID methods are called
> (before nodeChanged) -> Overwrites old confirmed session id in ZK with the
> new one before checking (making recovery impossible in nodeChanged)
>
> I am probably not completely aware of the subtleties of this problem but
> it seems to me that we should not immediately revoke leadership and fail
> jobs on suspended, and also it would be nice if nodeChanged would be called
> before confirmLeaderSessionID.
>
> Could someone with more experience please take a look as well?
>
> Thanks!
> Gyula
>
> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 25., H,
> 16:43):
>
>> Curator seems to auto reconnect anyways, the problem might be that there
>> is a new leader elected before the old JM could reconnect. We will try to
>> experiment with this tomorrow to see if increasing the timeouts do any good.
>>
>> Gyula
>>
>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. 25.,
>> H, 15:39):
>>
>>> I will try to check what Stephan suggested and get back to you!
>>>
>>> Thanks for the feedback
>>>
>>> Gyula
>>>
>>> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> wrote:
>>>
>>>> I think the question is whether the connection should be lost in the
>>>> case
>>>> of a rolling ZK update.
>>>>
>>>> There should always be a quorum online, so Curator should always be
>>>> able to
>>>> connect. So there is no need to revoke leadership.
>>>>
>>>> @gyula - can you check whether there is an option in Curator to
>>>> reconnect
>>>> to another quorum peer if one goes down?
>>>>
>>>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <trohrm...@apache.org>
>>>> wrote:
>>>>
>>>> > Hi Gyula,
>>>> >
>>>> > Flink uses internally the Curator LeaderLatch recipe to do leader
>>>> election.
>>>> > The LeaderLatch will revoke the leadership of a contender in case of a
>>>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The assumption
>>>> here
>>>> > is that if you cannot talk to ZooKeeper, then we can no longer be
>>>> sure that
>>>> > you are the leader.
>>>> >
>>>> > Consequently, if you do a rolling update of your ZooKeeper cluster
>>>> which
>>>> > causes client connections to be lost or suspended, then it will
>>>> trigger a
>>>> > restart of the Flink job upon reacquiring the leadership again.
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <gyula.f...@gmail.com>
>>>> wrote:
>>>> >
>>>> > > We are using 1.3.2
>>>> > >
>>>> > > Gyula
>>>> > >
>>>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com> wrote:
>>>> > >
>>>> > > > Which release are you using ?
>>>> > > >
>>>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader election
>>>> > issues.
>>>> > > >
>>>> > > > Mind giving 1.3.2 a try ?
>>>> > > >
>>>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <gyula.f...@gmail.com
>>>> >
>>>> > > wrote:
>>>> > > >
>>>> > > > > Hi all,
>>>> > > > >
>>>> > > > > We have observed that in case some nodes of the ZK cluster are
>>>> > > restarted
>>>> > > > > (for a rolling restart) the Flink Streaming jobs fail (and
>>>> restart).
>>>> > > > >
>>>> > > > > Log excerpt:
>>>> > > > >
>>>> > > > > 2017-09-22 12:54:41,426 INFO  org.apache.zookeeper.ClientCnxn
>>>> > > > >                      - Unable to read additional data from
>>>> server
>>>> > > > > sessionid 0x15cba6e1a239774, likely server has closed socket,
>>>> closing
>>>> > > > > socket connection and attempting reconnect
>>>> > > > > 2017-09-22 12:54:41,527 INFO
>>>> > > > > org.apache.flink.shaded.org.apache.curator.framework.
>>>> > > > > state.ConnectionStateManager
>>>> > > > >  - State change: SUSPENDED
>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>>> > > > > org.apache.flink.runtime.leaderelection.
>>>> > ZooKeeperLeaderElectionService
>>>> > > > >  - Connection to ZooKeeper suspended. The contender
>>>> > > > > akka.tcp://
>>>> fl...@splat.sto.midasplayer.com:42118/user/jobmanager no
>>>> > > > > longer participates in the leader election.
>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>>> > > > > org.apache.flink.runtime.leaderretrieval.
>>>> > > ZooKeeperLeaderRetrievalService
>>>> > > > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
>>>> > > > > leader from ZooKeeper.
>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>>> > > > > org.apache.flink.runtime.leaderretrieval.
>>>> > > ZooKeeperLeaderRetrievalService
>>>> > > > >  - Connection to ZooKeeper suspended. Can no longer retrieve the
>>>> > > > > leader from ZooKeeper.
>>>> > > > > 2017-09-22 12:54:41,530 WARN
>>>> > > > >
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
>>>> > -
>>>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted job
>>>> graphs
>>>> > > > > are not monitored (temporarily).
>>>> > > > > 2017-09-22 12:54:41,530 INFO
>>>> org.apache.flink.yarn.YarnJobManager
>>>> > > > >                      - JobManager
>>>> > > > > akka://flink/user/jobmanager#-317276879 was revoked leadership.
>>>> > > > > 2017-09-22 12:54:41,532 INFO
>>>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph        -
>>>> Job
>>>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched from
>>>> > state
>>>> > > > > RUNNING to SUSPENDED.
>>>> > > > > java.lang.Exception: JobManager is no longer the leader.
>>>> > > > >
>>>> > > > >
>>>> > > > > Is this the expected behaviour?
>>>> > > > >
>>>> > > > > Thanks,
>>>> > > > > Gyula
>>>> > > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>

Re: Zookeeper failure handling

Reply via email to