Re: Zookeeper failure handling

Till Rohrmann Fri, 29 Sep 2017 00:49:08 -0700

Yes this sounds like a good compromise for the moment. We could offer it as
a special HighAvailabilityServices implementation with loosened split-brain
safety guarantees but hardened connection suspension tolerance.


Cheers,
Till

On Thu, Sep 28, 2017 at 8:00 PM, Stephan Ewen <step...@data-artisans.com>
wrote:

> Hi!
>
> Good discussion!
>
> Seems the right long-term fix is the JM / TM reconciliation without
> failure, as Till pointed out.
>
> Another possibility could be to have a small timeout (say by default 5s or
> so) in which the Leader Service waits for either a re-connection or a new
> leader election before notifying the current leader.
>
> What do you think?
>
> Stephan
>
>
>
> On Wed, Sep 27, 2017 at 11:17 AM, Till Rohrmann <trohrm...@apache.org>
> wrote:
>
>> I agree that this is not very nice and can put a lot of stress on your
>> cluster.
>>
>> There is actually an open issue for exactly this [1] and also a PR [2].
>> The problem is that in the general case it will allow for split-brain
>> situations and therefore it has not been merged yet.
>>
>> I'm actually not quite sure whether YARN can give you strict guarantees
>> that at any moment there is at most one AM running. I suspect that this is
>> not the case and, thus, you could risk to run into the split-brain problem
>> there as well.
>>
>> I think a proper solution for this problem could be the recovery of
>> running jobs [3]. With that the TMs could continue executing the jobs even
>> if there is no leader anymore. The new leader (which could be the same JM),
>> would then recover the jobs from the TMs without having to restart them.
>> This feature, however, still needs some more work to be finalized.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-6174
>> [2] https://github.com/apache/flink/pull/3599
>> [3] https://issues.apache.org/jira/browse/FLINK-5703
>>
>> Cheers,
>> Till
>>
>> On Wed, Sep 27, 2017 at 10:58 AM, Gyula Fóra <gyula.f...@gmail.com>
>> wrote:
>>
>>> Hi Till,
>>> Thanks for the explanation, yes this sounds like a hard problem but it
>>> just
>>> seems wrong that whenever the ZK leader is restarted all the Flink jobs
>>> fail on a cluster.
>>> This might be within the overall guarantees of the system but can lead to
>>> some cascading failures if every job recovers at the same time in larger
>>> deployments.
>>>
>>> Maybe this is easier to avoid in certain setups for instance in YARN
>>> where
>>> we only run a single JM anyways at any given time.
>>>
>>> Gyula
>>>
>>> Till Rohrmann <t...@data-artisans.com> ezt írta (időpont: 2017. szept.
>>> 27.,
>>> Sze, 10:49):
>>>
>>> > Hi Gyula,
>>> >
>>> > if we don't listen to the LeaderLatch#notLeader call but instead wait
>>> > until we see (via the NodeCache) a new leader information being
>>> written to
>>> > the leader path in order to revoke leadership, then we potentially end
>>> up
>>> > running the same job twice. Even though this can theoretically already
>>> > happen, namely during the gap between of the server and client
>>> noticing the
>>> > lost connection, this gap should be practically non-existent. If we
>>> change
>>> > the behaviour, then this gap could potentially grow quite large
>>> leading to
>>> > all kinds of undesired side effects. E.g. if the sink operation is not
>>> > idempotent, then one might easily end up with thwarting ones exactly
>>> once
>>> > processing guarantees.
>>> >
>>> > I'm not sure whether we want to sacrifice the guarantee of not having
>>> to
>>> > deal with a split brain scenario but I can see the benefits of not
>>> > immediately revoking the leadership if one can guarantee that there
>>> will
>>> > never be two JMs competing for the leadership. However, in the general
>>> > case, this should be hard to do.
>>> >
>>> > Cheers,
>>> > Till
>>> >
>>> > On Wed, Sep 27, 2017 at 9:22 AM, Gyula Fóra <gyula.f...@gmail.com>
>>> wrote:
>>> >
>>> >> On a second iteration, the whole problem seems to stem from the fact
>>> that
>>> >> we revoke leadership from the JM when the notLeader method is called
>>> before
>>> >> waiting for a new leader to be elected. Ideally we should wait until
>>> >> isLeader is called again to check who was the previous leader but I
>>> can see
>>> >> how this might lead to split brain scenarios if the previous leader
>>> loses
>>> >> connection to ZK while still maintaining connection to the TMs.
>>> >>
>>> >> Gyula
>>> >>
>>> >> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
>>> 26.,
>>> >> K, 18:34):
>>> >>
>>> >>> Hi,
>>> >>>
>>> >>> I did some experimenting and found something that is interesting and
>>> >>> looks off.
>>> >>>
>>> >>> So the only problem is when the ZK leader is restarted, not related
>>> to
>>> >>> any retry/reconnect logic (not affected by the timeout setting).
>>> >>> I think the following is happening (based on the logs
>>> >>> https://gist.github.com/gyfora/acb55e380d932ac10593fc1fd37930ab):
>>> >>>
>>> >>> 1. Connection is suspended, notLeader method is called  -> revokes
>>> >>> leadership without checking anything, kills jobs
>>> >>> 2. Reconnects , isLeader and confirmLeaderSessionID methods are
>>> called
>>> >>> (before nodeChanged) -> Overwrites old confirmed session id in ZK
>>> with the
>>> >>> new one before checking (making recovery impossible in nodeChanged)
>>> >>>
>>> >>> I am probably not completely aware of the subtleties of this problem
>>> but
>>> >>> it seems to me that we should not immediately revoke leadership and
>>> fail
>>> >>> jobs on suspended, and also it would be nice if nodeChanged would be
>>> called
>>> >>> before confirmLeaderSessionID.
>>> >>>
>>> >>> Could someone with more experience please take a look as well?
>>> >>>
>>> >>> Thanks!
>>> >>> Gyula
>>> >>>
>>> >>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
>>> 25.,
>>> >>> H, 16:43):
>>> >>>
>>> >>>> Curator seems to auto reconnect anyways, the problem might be that
>>> >>>> there is a new leader elected before the old JM could reconnect. We
>>> will
>>> >>>> try to experiment with this tomorrow to see if increasing the
>>> timeouts do
>>> >>>> any good.
>>> >>>>
>>> >>>> Gyula
>>> >>>>
>>> >>>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
>>> 25.,
>>> >>>> H, 15:39):
>>> >>>>
>>> >>>>> I will try to check what Stephan suggested and get back to you!
>>> >>>>>
>>> >>>>> Thanks for the feedback
>>> >>>>>
>>> >>>>> Gyula
>>> >>>>>
>>> >>>>> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> wrote:
>>> >>>>>
>>> >>>>>> I think the question is whether the connection should be lost in
>>> the
>>> >>>>>> case
>>> >>>>>> of a rolling ZK update.
>>> >>>>>>
>>> >>>>>> There should always be a quorum online, so Curator should always
>>> be
>>> >>>>>> able to
>>> >>>>>> connect. So there is no need to revoke leadership.
>>> >>>>>>
>>> >>>>>> @gyula - can you check whether there is an option in Curator to
>>> >>>>>> reconnect
>>> >>>>>> to another quorum peer if one goes down?
>>> >>>>>>
>>> >>>>>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <
>>> trohrm...@apache.org>
>>> >>>>>> wrote:
>>> >>>>>>
>>> >>>>>> > Hi Gyula,
>>> >>>>>> >
>>> >>>>>> > Flink uses internally the Curator LeaderLatch recipe to do
>>> leader
>>> >>>>>> election.
>>> >>>>>> > The LeaderLatch will revoke the leadership of a contender in
>>> case
>>> >>>>>> of a
>>> >>>>>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The
>>> >>>>>> assumption here
>>> >>>>>> > is that if you cannot talk to ZooKeeper, then we can no longer
>>> be
>>> >>>>>> sure that
>>> >>>>>> > you are the leader.
>>> >>>>>> >
>>> >>>>>> > Consequently, if you do a rolling update of your ZooKeeper
>>> cluster
>>> >>>>>> which
>>> >>>>>> > causes client connections to be lost or suspended, then it will
>>> >>>>>> trigger a
>>> >>>>>> > restart of the Flink job upon reacquiring the leadership again.
>>> >>>>>> >
>>> >>>>>> > Cheers,
>>> >>>>>> > Till
>>> >>>>>> >
>>> >>>>>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <
>>> gyula.f...@gmail.com>
>>> >>>>>> wrote:
>>> >>>>>> >
>>> >>>>>> > > We are using 1.3.2
>>> >>>>>> > >
>>> >>>>>> > > Gyula
>>> >>>>>> > >
>>> >>>>>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com>
>>> wrote:
>>> >>>>>> > >
>>> >>>>>> > > > Which release are you using ?
>>> >>>>>> > > >
>>> >>>>>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader
>>> >>>>>> election
>>> >>>>>> > issues.
>>> >>>>>> > > >
>>> >>>>>> > > > Mind giving 1.3.2 a try ?
>>> >>>>>> > > >
>>> >>>>>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <
>>> >>>>>> gyula.f...@gmail.com>
>>> >>>>>> > > wrote:
>>> >>>>>> > > >
>>> >>>>>> > > > > Hi all,
>>> >>>>>> > > > >
>>> >>>>>> > > > > We have observed that in case some nodes of the ZK
>>> cluster are
>>> >>>>>> > > restarted
>>> >>>>>> > > > > (for a rolling restart) the Flink Streaming jobs fail (and
>>> >>>>>> restart).
>>> >>>>>> > > > >
>>> >>>>>> > > > > Log excerpt:
>>> >>>>>> > > > >
>>> >>>>>> > > > > 2017-09-22 12:54:41,426 INFO
>>> org.apache.zookeeper.ClientCnxn
>>> >>>>>> > > > >                      - Unable to read additional data from
>>> >>>>>> server
>>> >>>>>> > > > > sessionid 0x15cba6e1a239774, likely server has closed
>>> socket,
>>> >>>>>> closing
>>> >>>>>> > > > > socket connection and attempting reconnect
>>> >>>>>> > > > > 2017-09-22 12:54:41,527 INFO
>>> >>>>>> > > > > org.apache.flink.shaded.org.apache.curator.framework.
>>> >>>>>> > > > > state.ConnectionStateManager
>>> >>>>>> > > > >  - State change: SUSPENDED
>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>> >>>>>> > > > > org.apache.flink.runtime.leaderelection.
>>> >>>>>> > ZooKeeperLeaderElectionService
>>> >>>>>> > > > >  - Connection to ZooKeeper suspended. The contender
>>> >>>>>> > > > > akka.tcp://
>>> >>>>>> fl...@splat.sto.midasplayer.com:42118/user/jobmanager no
>>> >>>>>> > > > > longer participates in the leader election.
>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>> >>>>>> > > > > org.apache.flink.runtime.leaderretrieval.
>>> >>>>>> > > ZooKeeperLeaderRetrievalService
>>> >>>>>> > > > >  - Connection to ZooKeeper suspended. Can no longer
>>> retrieve
>>> >>>>>> the
>>> >>>>>> > > > > leader from ZooKeeper.
>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>> >>>>>> > > > > org.apache.flink.runtime.leaderretrieval.
>>> >>>>>> > > ZooKeeperLeaderRetrievalService
>>> >>>>>> > > > >  - Connection to ZooKeeper suspended. Can no longer
>>> retrieve
>>> >>>>>> the
>>> >>>>>> > > > > leader from ZooKeeper.
>>> >>>>>> > > > > 2017-09-22 12:54:41,530 WARN
>>> >>>>>> > > > >
>>> >>>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGra
>>> phStore
>>> >>>>>> > -
>>> >>>>>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted
>>> job
>>> >>>>>> graphs
>>> >>>>>> > > > > are not monitored (temporarily).
>>> >>>>>> > > > > 2017-09-22 12:54:41,530 INFO
>>> >>>>>> org.apache.flink.yarn.YarnJobManager
>>> >>>>>> > > > >                      - JobManager
>>> >>>>>> > > > > akka://flink/user/jobmanager#-317276879 was revoked
>>> >>>>>> leadership.
>>> >>>>>> > > > > 2017-09-22 12:54:41,532 INFO
>>> >>>>>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>> >>>>>> - Job
>>> >>>>>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97)
>>> switched
>>> >>>>>> from
>>> >>>>>> > state
>>> >>>>>> > > > > RUNNING to SUSPENDED.
>>> >>>>>> > > > > java.lang.Exception: JobManager is no longer the leader.
>>> >>>>>> > > > >
>>> >>>>>> > > > >
>>> >>>>>> > > > > Is this the expected behaviour?
>>> >>>>>> > > > >
>>> >>>>>> > > > > Thanks,
>>> >>>>>> > > > > Gyula
>>> >>>>>> > > > >
>>> >>>>>> > > >
>>> >>>>>> > >
>>> >>>>>> >
>>> >>>>>>
>>> >>>>>
>>> >
>>> >
>>> > --
>>> > Data Artisans GmbH | Stresemannstrasse 121a | 10963 Berlin
>>> <https://maps.google.com/?q=Stresemannstrasse+121a+%7C+10963+Berlin&entry=gmail&source=g>
>>> > <https://maps.google.com/?q=Stresemannstrasse+121a+%7C+10963
>>> +Berlin&entry=gmail&source=g>
>>> >
>>> > i...@data-artisans.com
>>> > phone +493055599146
>>> > mobile +491715521046
>>> >
>>> > Registered at Amtsgericht Charlottenburg - HRB 158244 B
>>> > Managing Directors: Kostas Tzoumas, Stephan Ewen
>>> >
>>>
>>
>>
>

Re: Zookeeper failure handling

Reply via email to