Re: Zookeeper failure handling

Gyula Fóra Fri, 29 Sep 2017 01:11:36 -0700

Hi,

I think Stephan's idea is good (not sure how small timeout is enough) if
it's easy to add then we should definitely have this as an optional setting
:)
Otherwise if it's too big of an effort we could just stick with the plans
for the proper solution as this is not super critical.


Cheers,
Gyula

Till Rohrmann <trohrm...@apache.org> ezt írta (időpont: 2017. szept. 29.,
P, 9:48):

> Yes this sounds like a good compromise for the moment. We could offer it
> as a special HighAvailabilityServices implementation with loosened
> split-brain safety guarantees but hardened connection suspension tolerance.
>
> Cheers,
> Till
>
> On Thu, Sep 28, 2017 at 8:00 PM, Stephan Ewen <step...@data-artisans.com>
> wrote:
>
>> Hi!
>>
>> Good discussion!
>>
>> Seems the right long-term fix is the JM / TM reconciliation without
>> failure, as Till pointed out.
>>
>> Another possibility could be to have a small timeout (say by default 5s
>> or so) in which the Leader Service waits for either a re-connection or a
>> new leader election before notifying the current leader.
>>
>> What do you think?
>>
>> Stephan
>>
>>
>>
>> On Wed, Sep 27, 2017 at 11:17 AM, Till Rohrmann <trohrm...@apache.org>
>> wrote:
>>
>>> I agree that this is not very nice and can put a lot of stress on your
>>> cluster.
>>>
>>> There is actually an open issue for exactly this [1] and also a PR [2].
>>> The problem is that in the general case it will allow for split-brain
>>> situations and therefore it has not been merged yet.
>>>
>>> I'm actually not quite sure whether YARN can give you strict guarantees
>>> that at any moment there is at most one AM running. I suspect that this is
>>> not the case and, thus, you could risk to run into the split-brain problem
>>> there as well.
>>>
>>> I think a proper solution for this problem could be the recovery of
>>> running jobs [3]. With that the TMs could continue executing the jobs even
>>> if there is no leader anymore. The new leader (which could be the same JM),
>>> would then recover the jobs from the TMs without having to restart them.
>>> This feature, however, still needs some more work to be finalized.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-6174
>>> [2] https://github.com/apache/flink/pull/3599
>>> [3] https://issues.apache.org/jira/browse/FLINK-5703
>>>
>>> Cheers,
>>> Till
>>>
>>> On Wed, Sep 27, 2017 at 10:58 AM, Gyula Fóra <gyula.f...@gmail.com>
>>> wrote:
>>>
>>>> Hi Till,
>>>> Thanks for the explanation, yes this sounds like a hard problem but it
>>>> just
>>>> seems wrong that whenever the ZK leader is restarted all the Flink jobs
>>>> fail on a cluster.
>>>> This might be within the overall guarantees of the system but can lead
>>>> to
>>>> some cascading failures if every job recovers at the same time in larger
>>>> deployments.
>>>>
>>>> Maybe this is easier to avoid in certain setups for instance in YARN
>>>> where
>>>> we only run a single JM anyways at any given time.
>>>>
>>>> Gyula
>>>>
>>>> Till Rohrmann <t...@data-artisans.com> ezt írta (időpont: 2017. szept.
>>>> 27.,
>>>> Sze, 10:49):
>>>>
>>>> > Hi Gyula,
>>>> >
>>>> > if we don't listen to the LeaderLatch#notLeader call but instead wait
>>>> > until we see (via the NodeCache) a new leader information being
>>>> written to
>>>> > the leader path in order to revoke leadership, then we potentially
>>>> end up
>>>> > running the same job twice. Even though this can theoretically already
>>>> > happen, namely during the gap between of the server and client
>>>> noticing the
>>>> > lost connection, this gap should be practically non-existent. If we
>>>> change
>>>> > the behaviour, then this gap could potentially grow quite large
>>>> leading to
>>>> > all kinds of undesired side effects. E.g. if the sink operation is not
>>>> > idempotent, then one might easily end up with thwarting ones exactly
>>>> once
>>>> > processing guarantees.
>>>> >
>>>> > I'm not sure whether we want to sacrifice the guarantee of not having
>>>> to
>>>> > deal with a split brain scenario but I can see the benefits of not
>>>> > immediately revoking the leadership if one can guarantee that there
>>>> will
>>>> > never be two JMs competing for the leadership. However, in the general
>>>> > case, this should be hard to do.
>>>> >
>>>> > Cheers,
>>>> > Till
>>>> >
>>>> > On Wed, Sep 27, 2017 at 9:22 AM, Gyula Fóra <gyula.f...@gmail.com>
>>>> wrote:
>>>> >
>>>> >> On a second iteration, the whole problem seems to stem from the fact
>>>> that
>>>> >> we revoke leadership from the JM when the notLeader method is called
>>>> before
>>>> >> waiting for a new leader to be elected. Ideally we should wait until
>>>> >> isLeader is called again to check who was the previous leader but I
>>>> can see
>>>> >> how this might lead to split brain scenarios if the previous leader
>>>> loses
>>>> >> connection to ZK while still maintaining connection to the TMs.
>>>> >>
>>>> >> Gyula
>>>> >>
>>>> >> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
>>>> 26.,
>>>> >> K, 18:34):
>>>> >>
>>>> >>> Hi,
>>>> >>>
>>>> >>> I did some experimenting and found something that is interesting and
>>>> >>> looks off.
>>>> >>>
>>>> >>> So the only problem is when the ZK leader is restarted, not related
>>>> to
>>>> >>> any retry/reconnect logic (not affected by the timeout setting).
>>>> >>> I think the following is happening (based on the logs
>>>> >>> https://gist.github.com/gyfora/acb55e380d932ac10593fc1fd37930ab):
>>>> >>>
>>>> >>> 1. Connection is suspended, notLeader method is called  -> revokes
>>>> >>> leadership without checking anything, kills jobs
>>>> >>> 2. Reconnects , isLeader and confirmLeaderSessionID methods are
>>>> called
>>>> >>> (before nodeChanged) -> Overwrites old confirmed session id in ZK
>>>> with the
>>>> >>> new one before checking (making recovery impossible in nodeChanged)
>>>> >>>
>>>> >>> I am probably not completely aware of the subtleties of this
>>>> problem but
>>>> >>> it seems to me that we should not immediately revoke leadership and
>>>> fail
>>>> >>> jobs on suspended, and also it would be nice if nodeChanged would
>>>> be called
>>>> >>> before confirmLeaderSessionID.
>>>> >>>
>>>> >>> Could someone with more experience please take a look as well?
>>>> >>>
>>>> >>> Thanks!
>>>> >>> Gyula
>>>> >>>
>>>> >>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
>>>> 25.,
>>>> >>> H, 16:43):
>>>> >>>
>>>> >>>> Curator seems to auto reconnect anyways, the problem might be that
>>>> >>>> there is a new leader elected before the old JM could reconnect.
>>>> We will
>>>> >>>> try to experiment with this tomorrow to see if increasing the
>>>> timeouts do
>>>> >>>> any good.
>>>> >>>>
>>>> >>>> Gyula
>>>> >>>>
>>>> >>>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
>>>> 25.,
>>>> >>>> H, 15:39):
>>>> >>>>
>>>> >>>>> I will try to check what Stephan suggested and get back to you!
>>>> >>>>>
>>>> >>>>> Thanks for the feedback
>>>> >>>>>
>>>> >>>>> Gyula
>>>> >>>>>
>>>> >>>>> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org>
>>>> wrote:
>>>> >>>>>
>>>> >>>>>> I think the question is whether the connection should be lost in
>>>> the
>>>> >>>>>> case
>>>> >>>>>> of a rolling ZK update.
>>>> >>>>>>
>>>> >>>>>> There should always be a quorum online, so Curator should always
>>>> be
>>>> >>>>>> able to
>>>> >>>>>> connect. So there is no need to revoke leadership.
>>>> >>>>>>
>>>> >>>>>> @gyula - can you check whether there is an option in Curator to
>>>> >>>>>> reconnect
>>>> >>>>>> to another quorum peer if one goes down?
>>>> >>>>>>
>>>> >>>>>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <
>>>> trohrm...@apache.org>
>>>> >>>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> > Hi Gyula,
>>>> >>>>>> >
>>>> >>>>>> > Flink uses internally the Curator LeaderLatch recipe to do
>>>> leader
>>>> >>>>>> election.
>>>> >>>>>> > The LeaderLatch will revoke the leadership of a contender in
>>>> case
>>>> >>>>>> of a
>>>> >>>>>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The
>>>> >>>>>> assumption here
>>>> >>>>>> > is that if you cannot talk to ZooKeeper, then we can no longer
>>>> be
>>>> >>>>>> sure that
>>>> >>>>>> > you are the leader.
>>>> >>>>>> >
>>>> >>>>>> > Consequently, if you do a rolling update of your ZooKeeper
>>>> cluster
>>>> >>>>>> which
>>>> >>>>>> > causes client connections to be lost or suspended, then it will
>>>> >>>>>> trigger a
>>>> >>>>>> > restart of the Flink job upon reacquiring the leadership again.
>>>> >>>>>> >
>>>> >>>>>> > Cheers,
>>>> >>>>>> > Till
>>>> >>>>>> >
>>>> >>>>>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <
>>>> gyula.f...@gmail.com>
>>>> >>>>>> wrote:
>>>> >>>>>> >
>>>> >>>>>> > > We are using 1.3.2
>>>> >>>>>> > >
>>>> >>>>>> > > Gyula
>>>> >>>>>> > >
>>>> >>>>>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com>
>>>> wrote:
>>>> >>>>>> > >
>>>> >>>>>> > > > Which release are you using ?
>>>> >>>>>> > > >
>>>> >>>>>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader
>>>> >>>>>> election
>>>> >>>>>> > issues.
>>>> >>>>>> > > >
>>>> >>>>>> > > > Mind giving 1.3.2 a try ?
>>>> >>>>>> > > >
>>>> >>>>>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <
>>>> >>>>>> gyula.f...@gmail.com>
>>>> >>>>>> > > wrote:
>>>> >>>>>> > > >
>>>> >>>>>> > > > > Hi all,
>>>> >>>>>> > > > >
>>>> >>>>>> > > > > We have observed that in case some nodes of the ZK
>>>> cluster are
>>>> >>>>>> > > restarted
>>>> >>>>>> > > > > (for a rolling restart) the Flink Streaming jobs fail
>>>> (and
>>>> >>>>>> restart).
>>>> >>>>>> > > > >
>>>> >>>>>> > > > > Log excerpt:
>>>> >>>>>> > > > >
>>>> >>>>>> > > > > 2017-09-22 12:54:41,426 INFO
>>>> org.apache.zookeeper.ClientCnxn
>>>> >>>>>> > > > >                      - Unable to read additional data
>>>> from
>>>> >>>>>> server
>>>> >>>>>> > > > > sessionid 0x15cba6e1a239774, likely server has closed
>>>> socket,
>>>> >>>>>> closing
>>>> >>>>>> > > > > socket connection and attempting reconnect
>>>> >>>>>> > > > > 2017-09-22 12:54:41,527 INFO
>>>> >>>>>> > > > > org.apache.flink.shaded.org.apache.curator.framework.
>>>> >>>>>> > > > > state.ConnectionStateManager
>>>> >>>>>> > > > >  - State change: SUSPENDED
>>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>>> >>>>>> > > > > org.apache.flink.runtime.leaderelection.
>>>> >>>>>> > ZooKeeperLeaderElectionService
>>>> >>>>>> > > > >  - Connection to ZooKeeper suspended. The contender
>>>> >>>>>> > > > > akka.tcp://
>>>> >>>>>> fl...@splat.sto.midasplayer.com:42118/user/jobmanager no
>>>> >>>>>> > > > > longer participates in the leader election.
>>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>>> >>>>>> > > > > org.apache.flink.runtime.leaderretrieval.
>>>> >>>>>> > > ZooKeeperLeaderRetrievalService
>>>> >>>>>> > > > >  - Connection to ZooKeeper suspended. Can no longer
>>>> retrieve
>>>> >>>>>> the
>>>> >>>>>> > > > > leader from ZooKeeper.
>>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
>>>> >>>>>> > > > > org.apache.flink.runtime.leaderretrieval.
>>>> >>>>>> > > ZooKeeperLeaderRetrievalService
>>>> >>>>>> > > > >  - Connection to ZooKeeper suspended. Can no longer
>>>> retrieve
>>>> >>>>>> the
>>>> >>>>>> > > > > leader from ZooKeeper.
>>>> >>>>>> > > > > 2017-09-22 12:54:41,530 WARN
>>>> >>>>>> > > > >
>>>> >>>>>>
>>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
>>>> >>>>>> > -
>>>> >>>>>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted
>>>> job
>>>> >>>>>> graphs
>>>> >>>>>> > > > > are not monitored (temporarily).
>>>> >>>>>> > > > > 2017-09-22 12:54:41,530 INFO
>>>> >>>>>> org.apache.flink.yarn.YarnJobManager
>>>> >>>>>> > > > >                      - JobManager
>>>> >>>>>> > > > > akka://flink/user/jobmanager#-317276879 was revoked
>>>> >>>>>> leadership.
>>>> >>>>>> > > > > 2017-09-22 12:54:41,532 INFO
>>>> >>>>>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>> >>>>>> - Job
>>>> >>>>>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97)
>>>> switched
>>>> >>>>>> from
>>>> >>>>>> > state
>>>> >>>>>> > > > > RUNNING to SUSPENDED.
>>>> >>>>>> > > > > java.lang.Exception: JobManager is no longer the leader.
>>>> >>>>>> > > > >
>>>> >>>>>> > > > >
>>>> >>>>>> > > > > Is this the expected behaviour?
>>>> >>>>>> > > > >
>>>> >>>>>> > > > > Thanks,
>>>> >>>>>> > > > > Gyula
>>>> >>>>>> > > > >
>>>> >>>>>> > > >
>>>> >>>>>> > >
>>>> >>>>>> >
>>>> >>>>>>
>>>> >>>>>
>>>> >
>>>> >
>>>> > --
>>>> > Data Artisans GmbH | Stresemannstrasse 121a | 10963 Berlin
>>>> <https://maps.google.com/?q=Stresemannstrasse+121a+%7C+10963+Berlin&entry=gmail&source=g>
>>>> > <
>>>> https://maps.google.com/?q=Stresemannstrasse+121a+%7C+10963+Berlin&entry=gmail&source=g
>>>> >
>>>> >
>>>> > i...@data-artisans.com
>>>> > phone +493055599146
>>>> > mobile +491715521046
>>>> >
>>>> > Registered at Amtsgericht Charlottenburg - HRB 158244 B
>>>> > Managing Directors: Kostas Tzoumas, Stephan Ewen
>>>> >
>>>>
>>>
>>>
>>
>

Re: Zookeeper failure handling

Reply via email to