Re: Zookeeper failure handling

Gyula Fóra Wed, 27 Sep 2017 04:01:08 -0700

Thanks Till for pointing these issues out, you are right this is exactly
the same problem!


Gyula

Till Rohrmann <trohrm...@apache.org> ezt írta (időpont: 2017. szept. 27.,
Sze, 11:18):

> I agree that this is not very nice and can put a lot of stress on your
> cluster.
>
> There is actually an open issue for exactly this [1] and also a PR [2]. The
> problem is that in the general case it will allow for split-brain
> situations and therefore it has not been merged yet.
>
> I'm actually not quite sure whether YARN can give you strict guarantees
> that at any moment there is at most one AM running. I suspect that this is
> not the case and, thus, you could risk to run into the split-brain problem
> there as well.
>
> I think a proper solution for this problem could be the recovery of running
> jobs [3]. With that the TMs could continue executing the jobs even if there
> is no leader anymore. The new leader (which could be the same JM), would
> then recover the jobs from the TMs without having to restart them. This
> feature, however, still needs some more work to be finalized.
>
> [1] https://issues.apache.org/jira/browse/FLINK-6174
> [2] https://github.com/apache/flink/pull/3599
> [3] https://issues.apache.org/jira/browse/FLINK-5703
>
> Cheers,
> Till
>
> On Wed, Sep 27, 2017 at 10:58 AM, Gyula Fóra <gyula.f...@gmail.com> wrote:
>
> > Hi Till,
> > Thanks for the explanation, yes this sounds like a hard problem but it
> just
> > seems wrong that whenever the ZK leader is restarted all the Flink jobs
> > fail on a cluster.
> > This might be within the overall guarantees of the system but can lead to
> > some cascading failures if every job recovers at the same time in larger
> > deployments.
> >
> > Maybe this is easier to avoid in certain setups for instance in YARN
> where
> > we only run a single JM anyways at any given time.
> >
> > Gyula
> >
> > Till Rohrmann <t...@data-artisans.com> ezt írta (időpont: 2017. szept.
> > 27.,
> > Sze, 10:49):
> >
> > > Hi Gyula,
> > >
> > > if we don't listen to the LeaderLatch#notLeader call but instead wait
> > > until we see (via the NodeCache) a new leader information being written
> > to
> > > the leader path in order to revoke leadership, then we potentially end
> up
> > > running the same job twice. Even though this can theoretically already
> > > happen, namely during the gap between of the server and client noticing
> > the
> > > lost connection, this gap should be practically non-existent. If we
> > change
> > > the behaviour, then this gap could potentially grow quite large leading
> > to
> > > all kinds of undesired side effects. E.g. if the sink operation is not
> > > idempotent, then one might easily end up with thwarting ones exactly
> once
> > > processing guarantees.
> > >
> > > I'm not sure whether we want to sacrifice the guarantee of not having
> to
> > > deal with a split brain scenario but I can see the benefits of not
> > > immediately revoking the leadership if one can guarantee that there
> will
> > > never be two JMs competing for the leadership. However, in the general
> > > case, this should be hard to do.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Wed, Sep 27, 2017 at 9:22 AM, Gyula Fóra <gyula.f...@gmail.com>
> > wrote:
> > >
> > >> On a second iteration, the whole problem seems to stem from the fact
> > that
> > >> we revoke leadership from the JM when the notLeader method is called
> > before
> > >> waiting for a new leader to be elected. Ideally we should wait until
> > >> isLeader is called again to check who was the previous leader but I
> can
> > see
> > >> how this might lead to split brain scenarios if the previous leader
> > loses
> > >> connection to ZK while still maintaining connection to the TMs.
> > >>
> > >> Gyula
> > >>
> > >> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
> 26.,
> > >> K, 18:34):
> > >>
> > >>> Hi,
> > >>>
> > >>> I did some experimenting and found something that is interesting and
> > >>> looks off.
> > >>>
> > >>> So the only problem is when the ZK leader is restarted, not related
> to
> > >>> any retry/reconnect logic (not affected by the timeout setting).
> > >>> I think the following is happening (based on the logs
> > >>> https://gist.github.com/gyfora/acb55e380d932ac10593fc1fd37930ab):
> > >>>
> > >>> 1. Connection is suspended, notLeader method is called  -> revokes
> > >>> leadership without checking anything, kills jobs
> > >>> 2. Reconnects , isLeader and confirmLeaderSessionID methods are
> called
> > >>> (before nodeChanged) -> Overwrites old confirmed session id in ZK
> with
> > the
> > >>> new one before checking (making recovery impossible in nodeChanged)
> > >>>
> > >>> I am probably not completely aware of the subtleties of this problem
> > but
> > >>> it seems to me that we should not immediately revoke leadership and
> > fail
> > >>> jobs on suspended, and also it would be nice if nodeChanged would be
> > called
> > >>> before confirmLeaderSessionID.
> > >>>
> > >>> Could someone with more experience please take a look as well?
> > >>>
> > >>> Thanks!
> > >>> Gyula
> > >>>
> > >>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
> 25.,
> > >>> H, 16:43):
> > >>>
> > >>>> Curator seems to auto reconnect anyways, the problem might be that
> > >>>> there is a new leader elected before the old JM could reconnect. We
> > will
> > >>>> try to experiment with this tomorrow to see if increasing the
> > timeouts do
> > >>>> any good.
> > >>>>
> > >>>> Gyula
> > >>>>
> > >>>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept.
> > 25.,
> > >>>> H, 15:39):
> > >>>>
> > >>>>> I will try to check what Stephan suggested and get back to you!
> > >>>>>
> > >>>>> Thanks for the feedback
> > >>>>>
> > >>>>> Gyula
> > >>>>>
> > >>>>> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> wrote:
> > >>>>>
> > >>>>>> I think the question is whether the connection should be lost in
> the
> > >>>>>> case
> > >>>>>> of a rolling ZK update.
> > >>>>>>
> > >>>>>> There should always be a quorum online, so Curator should always
> be
> > >>>>>> able to
> > >>>>>> connect. So there is no need to revoke leadership.
> > >>>>>>
> > >>>>>> @gyula - can you check whether there is an option in Curator to
> > >>>>>> reconnect
> > >>>>>> to another quorum peer if one goes down?
> > >>>>>>
> > >>>>>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann <
> > trohrm...@apache.org>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>> > Hi Gyula,
> > >>>>>> >
> > >>>>>> > Flink uses internally the Curator LeaderLatch recipe to do
> leader
> > >>>>>> election.
> > >>>>>> > The LeaderLatch will revoke the leadership of a contender in
> case
> > >>>>>> of a
> > >>>>>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The
> > >>>>>> assumption here
> > >>>>>> > is that if you cannot talk to ZooKeeper, then we can no longer
> be
> > >>>>>> sure that
> > >>>>>> > you are the leader.
> > >>>>>> >
> > >>>>>> > Consequently, if you do a rolling update of your ZooKeeper
> cluster
> > >>>>>> which
> > >>>>>> > causes client connections to be lost or suspended, then it will
> > >>>>>> trigger a
> > >>>>>> > restart of the Flink job upon reacquiring the leadership again.
> > >>>>>> >
> > >>>>>> > Cheers,
> > >>>>>> > Till
> > >>>>>> >
> > >>>>>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra <
> gyula.f...@gmail.com
> > >
> > >>>>>> wrote:
> > >>>>>> >
> > >>>>>> > > We are using 1.3.2
> > >>>>>> > >
> > >>>>>> > > Gyula
> > >>>>>> > >
> > >>>>>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com>
> wrote:
> > >>>>>> > >
> > >>>>>> > > > Which release are you using ?
> > >>>>>> > > >
> > >>>>>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader
> > >>>>>> election
> > >>>>>> > issues.
> > >>>>>> > > >
> > >>>>>> > > > Mind giving 1.3.2 a try ?
> > >>>>>> > > >
> > >>>>>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra <
> > >>>>>> gyula.f...@gmail.com>
> > >>>>>> > > wrote:
> > >>>>>> > > >
> > >>>>>> > > > > Hi all,
> > >>>>>> > > > >
> > >>>>>> > > > > We have observed that in case some nodes of the ZK cluster
> > are
> > >>>>>> > > restarted
> > >>>>>> > > > > (for a rolling restart) the Flink Streaming jobs fail (and
> > >>>>>> restart).
> > >>>>>> > > > >
> > >>>>>> > > > > Log excerpt:
> > >>>>>> > > > >
> > >>>>>> > > > > 2017-09-22 12:54:41,426 INFO  org.apache.zookeeper.
> > ClientCnxn
> > >>>>>> > > > >                      - Unable to read additional data from
> > >>>>>> server
> > >>>>>> > > > > sessionid 0x15cba6e1a239774, likely server has closed
> > socket,
> > >>>>>> closing
> > >>>>>> > > > > socket connection and attempting reconnect
> > >>>>>> > > > > 2017-09-22 12:54:41,527 INFO
> > >>>>>> > > > > org.apache.flink.shaded.org.apache.curator.framework.
> > >>>>>> > > > > state.ConnectionStateManager
> > >>>>>> > > > >  - State change: SUSPENDED
> > >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
> > >>>>>> > > > > org.apache.flink.runtime.leaderelection.
> > >>>>>> > ZooKeeperLeaderElectionService
> > >>>>>> > > > >  - Connection to ZooKeeper suspended. The contender
> > >>>>>> > > > > akka.tcp://
> > >>>>>> fl...@splat.sto.midasplayer.com:42118/user/jobmanager no
> > >>>>>> > > > > longer participates in the leader election.
> > >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
> > >>>>>> > > > > org.apache.flink.runtime.leaderretrieval.
> > >>>>>> > > ZooKeeperLeaderRetrievalService
> > >>>>>> > > > >  - Connection to ZooKeeper suspended. Can no longer
> retrieve
> > >>>>>> the
> > >>>>>> > > > > leader from ZooKeeper.
> > >>>>>> > > > > 2017-09-22 12:54:41,528 WARN
> > >>>>>> > > > > org.apache.flink.runtime.leaderretrieval.
> > >>>>>> > > ZooKeeperLeaderRetrievalService
> > >>>>>> > > > >  - Connection to ZooKeeper suspended. Can no longer
> retrieve
> > >>>>>> the
> > >>>>>> > > > > leader from ZooKeeper.
> > >>>>>> > > > > 2017-09-22 12:54:41,530 WARN
> > >>>>>> > > > >
> > >>>>>>
> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore
> > >>>>>> > -
> > >>>>>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted
> job
> > >>>>>> graphs
> > >>>>>> > > > > are not monitored (temporarily).
> > >>>>>> > > > > 2017-09-22 12:54:41,530 INFO
> > >>>>>> org.apache.flink.yarn.YarnJobManager
> > >>>>>> > > > >                      - JobManager
> > >>>>>> > > > > akka://flink/user/jobmanager#-317276879 was revoked
> > >>>>>> leadership.
> > >>>>>> > > > > 2017-09-22 12:54:41,532 INFO
> > >>>>>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph
> > >>>>>> - Job
> > >>>>>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched
> > >>>>>> from
> > >>>>>> > state
> > >>>>>> > > > > RUNNING to SUSPENDED.
> > >>>>>> > > > > java.lang.Exception: JobManager is no longer the leader.
> > >>>>>> > > > >
> > >>>>>> > > > >
> > >>>>>> > > > > Is this the expected behaviour?
> > >>>>>> > > > >
> > >>>>>> > > > > Thanks,
> > >>>>>> > > > > Gyula
> > >>>>>> > > > >
> > >>>>>> > > >
> > >>>>>> > >
> > >>>>>> >
> > >>>>>>
> > >>>>>
> > >
> > >
> > > --
> > > Data Artisans GmbH | Stresemannstrasse 121a | 10963 Berlin
> > > <https://maps.google.com/?q=Stresemannstrasse+121a+%7C+
> > 10963+Berlin&entry=gmail&source=g>
> > >
> > > i...@data-artisans.com
> > > phone +493055599146 <+49%2030%2055599146>
> > > mobile +491715521046 <+49%20171%205521046>
> > >
> > > Registered at Amtsgericht Charlottenburg - HRB 158244 B
> > > Managing Directors: Kostas Tzoumas, Stephan Ewen
> > >
> >
>

Re: Zookeeper failure handling

Reply via email to