Hi, I think Stephan's idea is good (not sure how small timeout is enough) if it's easy to add then we should definitely have this as an optional setting :) Otherwise if it's too big of an effort we could just stick with the plans for the proper solution as this is not super critical.
Cheers, Gyula Till Rohrmann <trohrm...@apache.org> ezt írta (időpont: 2017. szept. 29., P, 9:48): > Yes this sounds like a good compromise for the moment. We could offer it > as a special HighAvailabilityServices implementation with loosened > split-brain safety guarantees but hardened connection suspension tolerance. > > Cheers, > Till > > On Thu, Sep 28, 2017 at 8:00 PM, Stephan Ewen <step...@data-artisans.com> > wrote: > >> Hi! >> >> Good discussion! >> >> Seems the right long-term fix is the JM / TM reconciliation without >> failure, as Till pointed out. >> >> Another possibility could be to have a small timeout (say by default 5s >> or so) in which the Leader Service waits for either a re-connection or a >> new leader election before notifying the current leader. >> >> What do you think? >> >> Stephan >> >> >> >> On Wed, Sep 27, 2017 at 11:17 AM, Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> I agree that this is not very nice and can put a lot of stress on your >>> cluster. >>> >>> There is actually an open issue for exactly this [1] and also a PR [2]. >>> The problem is that in the general case it will allow for split-brain >>> situations and therefore it has not been merged yet. >>> >>> I'm actually not quite sure whether YARN can give you strict guarantees >>> that at any moment there is at most one AM running. I suspect that this is >>> not the case and, thus, you could risk to run into the split-brain problem >>> there as well. >>> >>> I think a proper solution for this problem could be the recovery of >>> running jobs [3]. With that the TMs could continue executing the jobs even >>> if there is no leader anymore. The new leader (which could be the same JM), >>> would then recover the jobs from the TMs without having to restart them. >>> This feature, however, still needs some more work to be finalized. >>> >>> [1] https://issues.apache.org/jira/browse/FLINK-6174 >>> [2] https://github.com/apache/flink/pull/3599 >>> [3] https://issues.apache.org/jira/browse/FLINK-5703 >>> >>> Cheers, >>> Till >>> >>> On Wed, Sep 27, 2017 at 10:58 AM, Gyula Fóra <gyula.f...@gmail.com> >>> wrote: >>> >>>> Hi Till, >>>> Thanks for the explanation, yes this sounds like a hard problem but it >>>> just >>>> seems wrong that whenever the ZK leader is restarted all the Flink jobs >>>> fail on a cluster. >>>> This might be within the overall guarantees of the system but can lead >>>> to >>>> some cascading failures if every job recovers at the same time in larger >>>> deployments. >>>> >>>> Maybe this is easier to avoid in certain setups for instance in YARN >>>> where >>>> we only run a single JM anyways at any given time. >>>> >>>> Gyula >>>> >>>> Till Rohrmann <t...@data-artisans.com> ezt írta (időpont: 2017. szept. >>>> 27., >>>> Sze, 10:49): >>>> >>>> > Hi Gyula, >>>> > >>>> > if we don't listen to the LeaderLatch#notLeader call but instead wait >>>> > until we see (via the NodeCache) a new leader information being >>>> written to >>>> > the leader path in order to revoke leadership, then we potentially >>>> end up >>>> > running the same job twice. Even though this can theoretically already >>>> > happen, namely during the gap between of the server and client >>>> noticing the >>>> > lost connection, this gap should be practically non-existent. If we >>>> change >>>> > the behaviour, then this gap could potentially grow quite large >>>> leading to >>>> > all kinds of undesired side effects. E.g. if the sink operation is not >>>> > idempotent, then one might easily end up with thwarting ones exactly >>>> once >>>> > processing guarantees. >>>> > >>>> > I'm not sure whether we want to sacrifice the guarantee of not having >>>> to >>>> > deal with a split brain scenario but I can see the benefits of not >>>> > immediately revoking the leadership if one can guarantee that there >>>> will >>>> > never be two JMs competing for the leadership. However, in the general >>>> > case, this should be hard to do. >>>> > >>>> > Cheers, >>>> > Till >>>> > >>>> > On Wed, Sep 27, 2017 at 9:22 AM, Gyula Fóra <gyula.f...@gmail.com> >>>> wrote: >>>> > >>>> >> On a second iteration, the whole problem seems to stem from the fact >>>> that >>>> >> we revoke leadership from the JM when the notLeader method is called >>>> before >>>> >> waiting for a new leader to be elected. Ideally we should wait until >>>> >> isLeader is called again to check who was the previous leader but I >>>> can see >>>> >> how this might lead to split brain scenarios if the previous leader >>>> loses >>>> >> connection to ZK while still maintaining connection to the TMs. >>>> >> >>>> >> Gyula >>>> >> >>>> >> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. >>>> 26., >>>> >> K, 18:34): >>>> >> >>>> >>> Hi, >>>> >>> >>>> >>> I did some experimenting and found something that is interesting and >>>> >>> looks off. >>>> >>> >>>> >>> So the only problem is when the ZK leader is restarted, not related >>>> to >>>> >>> any retry/reconnect logic (not affected by the timeout setting). >>>> >>> I think the following is happening (based on the logs >>>> >>> https://gist.github.com/gyfora/acb55e380d932ac10593fc1fd37930ab): >>>> >>> >>>> >>> 1. Connection is suspended, notLeader method is called -> revokes >>>> >>> leadership without checking anything, kills jobs >>>> >>> 2. Reconnects , isLeader and confirmLeaderSessionID methods are >>>> called >>>> >>> (before nodeChanged) -> Overwrites old confirmed session id in ZK >>>> with the >>>> >>> new one before checking (making recovery impossible in nodeChanged) >>>> >>> >>>> >>> I am probably not completely aware of the subtleties of this >>>> problem but >>>> >>> it seems to me that we should not immediately revoke leadership and >>>> fail >>>> >>> jobs on suspended, and also it would be nice if nodeChanged would >>>> be called >>>> >>> before confirmLeaderSessionID. >>>> >>> >>>> >>> Could someone with more experience please take a look as well? >>>> >>> >>>> >>> Thanks! >>>> >>> Gyula >>>> >>> >>>> >>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. >>>> 25., >>>> >>> H, 16:43): >>>> >>> >>>> >>>> Curator seems to auto reconnect anyways, the problem might be that >>>> >>>> there is a new leader elected before the old JM could reconnect. >>>> We will >>>> >>>> try to experiment with this tomorrow to see if increasing the >>>> timeouts do >>>> >>>> any good. >>>> >>>> >>>> >>>> Gyula >>>> >>>> >>>> >>>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. >>>> 25., >>>> >>>> H, 15:39): >>>> >>>> >>>> >>>>> I will try to check what Stephan suggested and get back to you! >>>> >>>>> >>>> >>>>> Thanks for the feedback >>>> >>>>> >>>> >>>>> Gyula >>>> >>>>> >>>> >>>>> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> >>>> wrote: >>>> >>>>> >>>> >>>>>> I think the question is whether the connection should be lost in >>>> the >>>> >>>>>> case >>>> >>>>>> of a rolling ZK update. >>>> >>>>>> >>>> >>>>>> There should always be a quorum online, so Curator should always >>>> be >>>> >>>>>> able to >>>> >>>>>> connect. So there is no need to revoke leadership. >>>> >>>>>> >>>> >>>>>> @gyula - can you check whether there is an option in Curator to >>>> >>>>>> reconnect >>>> >>>>>> to another quorum peer if one goes down? >>>> >>>>>> >>>> >>>>>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann < >>>> trohrm...@apache.org> >>>> >>>>>> wrote: >>>> >>>>>> >>>> >>>>>> > Hi Gyula, >>>> >>>>>> > >>>> >>>>>> > Flink uses internally the Curator LeaderLatch recipe to do >>>> leader >>>> >>>>>> election. >>>> >>>>>> > The LeaderLatch will revoke the leadership of a contender in >>>> case >>>> >>>>>> of a >>>> >>>>>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The >>>> >>>>>> assumption here >>>> >>>>>> > is that if you cannot talk to ZooKeeper, then we can no longer >>>> be >>>> >>>>>> sure that >>>> >>>>>> > you are the leader. >>>> >>>>>> > >>>> >>>>>> > Consequently, if you do a rolling update of your ZooKeeper >>>> cluster >>>> >>>>>> which >>>> >>>>>> > causes client connections to be lost or suspended, then it will >>>> >>>>>> trigger a >>>> >>>>>> > restart of the Flink job upon reacquiring the leadership again. >>>> >>>>>> > >>>> >>>>>> > Cheers, >>>> >>>>>> > Till >>>> >>>>>> > >>>> >>>>>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra < >>>> gyula.f...@gmail.com> >>>> >>>>>> wrote: >>>> >>>>>> > >>>> >>>>>> > > We are using 1.3.2 >>>> >>>>>> > > >>>> >>>>>> > > Gyula >>>> >>>>>> > > >>>> >>>>>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com> >>>> wrote: >>>> >>>>>> > > >>>> >>>>>> > > > Which release are you using ? >>>> >>>>>> > > > >>>> >>>>>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader >>>> >>>>>> election >>>> >>>>>> > issues. >>>> >>>>>> > > > >>>> >>>>>> > > > Mind giving 1.3.2 a try ? >>>> >>>>>> > > > >>>> >>>>>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra < >>>> >>>>>> gyula.f...@gmail.com> >>>> >>>>>> > > wrote: >>>> >>>>>> > > > >>>> >>>>>> > > > > Hi all, >>>> >>>>>> > > > > >>>> >>>>>> > > > > We have observed that in case some nodes of the ZK >>>> cluster are >>>> >>>>>> > > restarted >>>> >>>>>> > > > > (for a rolling restart) the Flink Streaming jobs fail >>>> (and >>>> >>>>>> restart). >>>> >>>>>> > > > > >>>> >>>>>> > > > > Log excerpt: >>>> >>>>>> > > > > >>>> >>>>>> > > > > 2017-09-22 12:54:41,426 INFO >>>> org.apache.zookeeper.ClientCnxn >>>> >>>>>> > > > > - Unable to read additional data >>>> from >>>> >>>>>> server >>>> >>>>>> > > > > sessionid 0x15cba6e1a239774, likely server has closed >>>> socket, >>>> >>>>>> closing >>>> >>>>>> > > > > socket connection and attempting reconnect >>>> >>>>>> > > > > 2017-09-22 12:54:41,527 INFO >>>> >>>>>> > > > > org.apache.flink.shaded.org.apache.curator.framework. >>>> >>>>>> > > > > state.ConnectionStateManager >>>> >>>>>> > > > > - State change: SUSPENDED >>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN >>>> >>>>>> > > > > org.apache.flink.runtime.leaderelection. >>>> >>>>>> > ZooKeeperLeaderElectionService >>>> >>>>>> > > > > - Connection to ZooKeeper suspended. The contender >>>> >>>>>> > > > > akka.tcp:// >>>> >>>>>> fl...@splat.sto.midasplayer.com:42118/user/jobmanager no >>>> >>>>>> > > > > longer participates in the leader election. >>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN >>>> >>>>>> > > > > org.apache.flink.runtime.leaderretrieval. >>>> >>>>>> > > ZooKeeperLeaderRetrievalService >>>> >>>>>> > > > > - Connection to ZooKeeper suspended. Can no longer >>>> retrieve >>>> >>>>>> the >>>> >>>>>> > > > > leader from ZooKeeper. >>>> >>>>>> > > > > 2017-09-22 12:54:41,528 WARN >>>> >>>>>> > > > > org.apache.flink.runtime.leaderretrieval. >>>> >>>>>> > > ZooKeeperLeaderRetrievalService >>>> >>>>>> > > > > - Connection to ZooKeeper suspended. Can no longer >>>> retrieve >>>> >>>>>> the >>>> >>>>>> > > > > leader from ZooKeeper. >>>> >>>>>> > > > > 2017-09-22 12:54:41,530 WARN >>>> >>>>>> > > > > >>>> >>>>>> >>>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore >>>> >>>>>> > - >>>> >>>>>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted >>>> job >>>> >>>>>> graphs >>>> >>>>>> > > > > are not monitored (temporarily). >>>> >>>>>> > > > > 2017-09-22 12:54:41,530 INFO >>>> >>>>>> org.apache.flink.yarn.YarnJobManager >>>> >>>>>> > > > > - JobManager >>>> >>>>>> > > > > akka://flink/user/jobmanager#-317276879 was revoked >>>> >>>>>> leadership. >>>> >>>>>> > > > > 2017-09-22 12:54:41,532 INFO >>>> >>>>>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph >>>> >>>>>> - Job >>>> >>>>>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) >>>> switched >>>> >>>>>> from >>>> >>>>>> > state >>>> >>>>>> > > > > RUNNING to SUSPENDED. >>>> >>>>>> > > > > java.lang.Exception: JobManager is no longer the leader. >>>> >>>>>> > > > > >>>> >>>>>> > > > > >>>> >>>>>> > > > > Is this the expected behaviour? >>>> >>>>>> > > > > >>>> >>>>>> > > > > Thanks, >>>> >>>>>> > > > > Gyula >>>> >>>>>> > > > > >>>> >>>>>> > > > >>>> >>>>>> > > >>>> >>>>>> > >>>> >>>>>> >>>> >>>>> >>>> > >>>> > >>>> > -- >>>> > Data Artisans GmbH | Stresemannstrasse 121a | 10963 Berlin >>>> <https://maps.google.com/?q=Stresemannstrasse+121a+%7C+10963+Berlin&entry=gmail&source=g> >>>> > < >>>> https://maps.google.com/?q=Stresemannstrasse+121a+%7C+10963+Berlin&entry=gmail&source=g >>>> > >>>> > >>>> > i...@data-artisans.com >>>> > phone +493055599146 >>>> > mobile +491715521046 >>>> > >>>> > Registered at Amtsgericht Charlottenburg - HRB 158244 B >>>> > Managing Directors: Kostas Tzoumas, Stephan Ewen >>>> > >>>> >>> >>> >> >