Thanks Till for pointing these issues out, you are right this is exactly the same problem!
Gyula Till Rohrmann <trohrm...@apache.org> ezt írta (időpont: 2017. szept. 27., Sze, 11:18): > I agree that this is not very nice and can put a lot of stress on your > cluster. > > There is actually an open issue for exactly this [1] and also a PR [2]. The > problem is that in the general case it will allow for split-brain > situations and therefore it has not been merged yet. > > I'm actually not quite sure whether YARN can give you strict guarantees > that at any moment there is at most one AM running. I suspect that this is > not the case and, thus, you could risk to run into the split-brain problem > there as well. > > I think a proper solution for this problem could be the recovery of running > jobs [3]. With that the TMs could continue executing the jobs even if there > is no leader anymore. The new leader (which could be the same JM), would > then recover the jobs from the TMs without having to restart them. This > feature, however, still needs some more work to be finalized. > > [1] https://issues.apache.org/jira/browse/FLINK-6174 > [2] https://github.com/apache/flink/pull/3599 > [3] https://issues.apache.org/jira/browse/FLINK-5703 > > Cheers, > Till > > On Wed, Sep 27, 2017 at 10:58 AM, Gyula Fóra <gyula.f...@gmail.com> wrote: > > > Hi Till, > > Thanks for the explanation, yes this sounds like a hard problem but it > just > > seems wrong that whenever the ZK leader is restarted all the Flink jobs > > fail on a cluster. > > This might be within the overall guarantees of the system but can lead to > > some cascading failures if every job recovers at the same time in larger > > deployments. > > > > Maybe this is easier to avoid in certain setups for instance in YARN > where > > we only run a single JM anyways at any given time. > > > > Gyula > > > > Till Rohrmann <t...@data-artisans.com> ezt írta (időpont: 2017. szept. > > 27., > > Sze, 10:49): > > > > > Hi Gyula, > > > > > > if we don't listen to the LeaderLatch#notLeader call but instead wait > > > until we see (via the NodeCache) a new leader information being written > > to > > > the leader path in order to revoke leadership, then we potentially end > up > > > running the same job twice. Even though this can theoretically already > > > happen, namely during the gap between of the server and client noticing > > the > > > lost connection, this gap should be practically non-existent. If we > > change > > > the behaviour, then this gap could potentially grow quite large leading > > to > > > all kinds of undesired side effects. E.g. if the sink operation is not > > > idempotent, then one might easily end up with thwarting ones exactly > once > > > processing guarantees. > > > > > > I'm not sure whether we want to sacrifice the guarantee of not having > to > > > deal with a split brain scenario but I can see the benefits of not > > > immediately revoking the leadership if one can guarantee that there > will > > > never be two JMs competing for the leadership. However, in the general > > > case, this should be hard to do. > > > > > > Cheers, > > > Till > > > > > > On Wed, Sep 27, 2017 at 9:22 AM, Gyula Fóra <gyula.f...@gmail.com> > > wrote: > > > > > >> On a second iteration, the whole problem seems to stem from the fact > > that > > >> we revoke leadership from the JM when the notLeader method is called > > before > > >> waiting for a new leader to be elected. Ideally we should wait until > > >> isLeader is called again to check who was the previous leader but I > can > > see > > >> how this might lead to split brain scenarios if the previous leader > > loses > > >> connection to ZK while still maintaining connection to the TMs. > > >> > > >> Gyula > > >> > > >> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. > 26., > > >> K, 18:34): > > >> > > >>> Hi, > > >>> > > >>> I did some experimenting and found something that is interesting and > > >>> looks off. > > >>> > > >>> So the only problem is when the ZK leader is restarted, not related > to > > >>> any retry/reconnect logic (not affected by the timeout setting). > > >>> I think the following is happening (based on the logs > > >>> https://gist.github.com/gyfora/acb55e380d932ac10593fc1fd37930ab): > > >>> > > >>> 1. Connection is suspended, notLeader method is called -> revokes > > >>> leadership without checking anything, kills jobs > > >>> 2. Reconnects , isLeader and confirmLeaderSessionID methods are > called > > >>> (before nodeChanged) -> Overwrites old confirmed session id in ZK > with > > the > > >>> new one before checking (making recovery impossible in nodeChanged) > > >>> > > >>> I am probably not completely aware of the subtleties of this problem > > but > > >>> it seems to me that we should not immediately revoke leadership and > > fail > > >>> jobs on suspended, and also it would be nice if nodeChanged would be > > called > > >>> before confirmLeaderSessionID. > > >>> > > >>> Could someone with more experience please take a look as well? > > >>> > > >>> Thanks! > > >>> Gyula > > >>> > > >>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. > 25., > > >>> H, 16:43): > > >>> > > >>>> Curator seems to auto reconnect anyways, the problem might be that > > >>>> there is a new leader elected before the old JM could reconnect. We > > will > > >>>> try to experiment with this tomorrow to see if increasing the > > timeouts do > > >>>> any good. > > >>>> > > >>>> Gyula > > >>>> > > >>>> Gyula Fóra <gyula.f...@gmail.com> ezt írta (időpont: 2017. szept. > > 25., > > >>>> H, 15:39): > > >>>> > > >>>>> I will try to check what Stephan suggested and get back to you! > > >>>>> > > >>>>> Thanks for the feedback > > >>>>> > > >>>>> Gyula > > >>>>> > > >>>>> On Mon, Sep 25, 2017, 15:33 Stephan Ewen <se...@apache.org> wrote: > > >>>>> > > >>>>>> I think the question is whether the connection should be lost in > the > > >>>>>> case > > >>>>>> of a rolling ZK update. > > >>>>>> > > >>>>>> There should always be a quorum online, so Curator should always > be > > >>>>>> able to > > >>>>>> connect. So there is no need to revoke leadership. > > >>>>>> > > >>>>>> @gyula - can you check whether there is an option in Curator to > > >>>>>> reconnect > > >>>>>> to another quorum peer if one goes down? > > >>>>>> > > >>>>>> On Mon, Sep 25, 2017 at 2:10 PM, Till Rohrmann < > > trohrm...@apache.org> > > >>>>>> wrote: > > >>>>>> > > >>>>>> > Hi Gyula, > > >>>>>> > > > >>>>>> > Flink uses internally the Curator LeaderLatch recipe to do > leader > > >>>>>> election. > > >>>>>> > The LeaderLatch will revoke the leadership of a contender in > case > > >>>>>> of a > > >>>>>> > SUSPENDED or LOST connection to the ZooKeeper quorum. The > > >>>>>> assumption here > > >>>>>> > is that if you cannot talk to ZooKeeper, then we can no longer > be > > >>>>>> sure that > > >>>>>> > you are the leader. > > >>>>>> > > > >>>>>> > Consequently, if you do a rolling update of your ZooKeeper > cluster > > >>>>>> which > > >>>>>> > causes client connections to be lost or suspended, then it will > > >>>>>> trigger a > > >>>>>> > restart of the Flink job upon reacquiring the leadership again. > > >>>>>> > > > >>>>>> > Cheers, > > >>>>>> > Till > > >>>>>> > > > >>>>>> > On Fri, Sep 22, 2017 at 6:41 PM, Gyula Fóra < > gyula.f...@gmail.com > > > > > >>>>>> wrote: > > >>>>>> > > > >>>>>> > > We are using 1.3.2 > > >>>>>> > > > > >>>>>> > > Gyula > > >>>>>> > > > > >>>>>> > > On Fri, Sep 22, 2017, 17:13 Ted Yu <yuzhih...@gmail.com> > wrote: > > >>>>>> > > > > >>>>>> > > > Which release are you using ? > > >>>>>> > > > > > >>>>>> > > > Flink 1.3.2 uses Curator 2.12.0 which solves some leader > > >>>>>> election > > >>>>>> > issues. > > >>>>>> > > > > > >>>>>> > > > Mind giving 1.3.2 a try ? > > >>>>>> > > > > > >>>>>> > > > On Fri, Sep 22, 2017 at 4:54 AM, Gyula Fóra < > > >>>>>> gyula.f...@gmail.com> > > >>>>>> > > wrote: > > >>>>>> > > > > > >>>>>> > > > > Hi all, > > >>>>>> > > > > > > >>>>>> > > > > We have observed that in case some nodes of the ZK cluster > > are > > >>>>>> > > restarted > > >>>>>> > > > > (for a rolling restart) the Flink Streaming jobs fail (and > > >>>>>> restart). > > >>>>>> > > > > > > >>>>>> > > > > Log excerpt: > > >>>>>> > > > > > > >>>>>> > > > > 2017-09-22 12:54:41,426 INFO org.apache.zookeeper. > > ClientCnxn > > >>>>>> > > > > - Unable to read additional data from > > >>>>>> server > > >>>>>> > > > > sessionid 0x15cba6e1a239774, likely server has closed > > socket, > > >>>>>> closing > > >>>>>> > > > > socket connection and attempting reconnect > > >>>>>> > > > > 2017-09-22 12:54:41,527 INFO > > >>>>>> > > > > org.apache.flink.shaded.org.apache.curator.framework. > > >>>>>> > > > > state.ConnectionStateManager > > >>>>>> > > > > - State change: SUSPENDED > > >>>>>> > > > > 2017-09-22 12:54:41,528 WARN > > >>>>>> > > > > org.apache.flink.runtime.leaderelection. > > >>>>>> > ZooKeeperLeaderElectionService > > >>>>>> > > > > - Connection to ZooKeeper suspended. The contender > > >>>>>> > > > > akka.tcp:// > > >>>>>> fl...@splat.sto.midasplayer.com:42118/user/jobmanager no > > >>>>>> > > > > longer participates in the leader election. > > >>>>>> > > > > 2017-09-22 12:54:41,528 WARN > > >>>>>> > > > > org.apache.flink.runtime.leaderretrieval. > > >>>>>> > > ZooKeeperLeaderRetrievalService > > >>>>>> > > > > - Connection to ZooKeeper suspended. Can no longer > retrieve > > >>>>>> the > > >>>>>> > > > > leader from ZooKeeper. > > >>>>>> > > > > 2017-09-22 12:54:41,528 WARN > > >>>>>> > > > > org.apache.flink.runtime.leaderretrieval. > > >>>>>> > > ZooKeeperLeaderRetrievalService > > >>>>>> > > > > - Connection to ZooKeeper suspended. Can no longer > retrieve > > >>>>>> the > > >>>>>> > > > > leader from ZooKeeper. > > >>>>>> > > > > 2017-09-22 12:54:41,530 WARN > > >>>>>> > > > > > > >>>>>> > org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore > > >>>>>> > - > > >>>>>> > > > > ZooKeeper connection SUSPENDED. Changes to the submitted > job > > >>>>>> graphs > > >>>>>> > > > > are not monitored (temporarily). > > >>>>>> > > > > 2017-09-22 12:54:41,530 INFO > > >>>>>> org.apache.flink.yarn.YarnJobManager > > >>>>>> > > > > - JobManager > > >>>>>> > > > > akka://flink/user/jobmanager#-317276879 was revoked > > >>>>>> leadership. > > >>>>>> > > > > 2017-09-22 12:54:41,532 INFO > > >>>>>> > > > > org.apache.flink.runtime.executiongraph.ExecutionGraph > > >>>>>> - Job > > >>>>>> > > > > event.game.log (2ad7bbcc476bbe3735954fc414ffcb97) switched > > >>>>>> from > > >>>>>> > state > > >>>>>> > > > > RUNNING to SUSPENDED. > > >>>>>> > > > > java.lang.Exception: JobManager is no longer the leader. > > >>>>>> > > > > > > >>>>>> > > > > > > >>>>>> > > > > Is this the expected behaviour? > > >>>>>> > > > > > > >>>>>> > > > > Thanks, > > >>>>>> > > > > Gyula > > >>>>>> > > > > > > >>>>>> > > > > > >>>>>> > > > > >>>>>> > > > >>>>>> > > >>>>> > > > > > > > > > -- > > > Data Artisans GmbH | Stresemannstrasse 121a | 10963 Berlin > > > <https://maps.google.com/?q=Stresemannstrasse+121a+%7C+ > > 10963+Berlin&entry=gmail&source=g> > > > > > > i...@data-artisans.com > > > phone +493055599146 <+49%2030%2055599146> > > > mobile +491715521046 <+49%20171%205521046> > > > > > > Registered at Amtsgericht Charlottenburg - HRB 158244 B > > > Managing Directors: Kostas Tzoumas, Stephan Ewen > > > > > >