Hi Guys, sorry for the late reply. we found out the issue is not related to flink, there is a connection issue with zookeeper. we deploy our whole infra on k8s and using aws spot ec2, once the pod get restarted or lost spot instances we lost the log files... so sorry for not being able to share the log files.
Sharing some of our experiences: Job leader lost leadership issues can be caused due to different reasons, most properly due to zookeeper and this failure does not cause job failure as far as we have seen. And the Checkpoint timeout issue can also be due to zookeeper issue, because the last successful CK meta info is stored in ZK, if ZK has issue flink will not be able to restore from last CK.. On Tue, 7 Sept 2021 at 10:00, Matthias Pohl <matth...@ververica.com> wrote: > Hi Xiangyu, > thanks for reaching out to the community. Could you share the entire > TaskManager and JobManager logs with us? That might help investigating > what's going on. > > Best, > Matthias > > On Fri, Sep 3, 2021 at 10:07 AM Xiangyu Su <xian...@smaato.com> wrote: > >> Hi Yun, >> Thanks alot. >> I am running a test, and facing the "Job Leader lost leadership..." >> issue, and also the checkpointing timeout at the same time,, not sure >> whether those 2 things related to each other. >> regarding your question: >> 1. GC looks ok. >> 2. seems like once the "Job Leader lost leadership..." happens flink job >> can not successfully get restarted. >> and e.g here is some logs from one job failure: >> --------------- >> 2021-09-02 20:41:11,345 WARN org.apache.flink.runtime.taskmanager.Task >> [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18 >> (9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with >> failure cause: org.apache.flink.util.FlinkException: Disconnect from >> JobManager responsible for ec6fd88643747aafac06ee906e421a96. >> at >> org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660) >> at >> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181) >> at >> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189) >> at java.util.Optional.ifPresent(Optional.java:159) >> at >> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >> at akka.actor.ActorCell.invoke(ActorCell.scala:561) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >> at akka.dispatch.Mailbox.run(Mailbox.scala:225) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> Caused by: java.lang.Exception: Job leader for job id >> ec6fd88643747aafac06ee906e421a96 lost leadership. >> ... 24 more >> >> ----------- >> 2021-09-02 20:47:22,388 ERROR >> org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] - >> Authentication failed >> 2021-09-02 20:47:22,388 INFO >> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - >> Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/ >> 10.168.175.10:2181 >> 2021-09-02 20:47:22,388 WARN >> org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - >> SASL configuration failed: javax.security.auth.login.LoginException: No >> JAAS configuration section named 'Client' was found in specified JAAS >> configuration file: '/tmp/jaas-4480663428736118963.conf'. Will continue >> connection to Zookeeper server without SASL authentication, if Zookeeper >> server allows it. >> at >> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.Mailbox.run(Mailbox.scala:225) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.actor.ActorCell.invoke(ActorCell.scala:561) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302] >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> ~[?:1.8.0_302] >> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?] >> at >> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution >> attempt deb6e9dd535069eb66e2139fde5b77cd was not found. >> 2021-09-02 20:47:21,870 INFO >> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Cannot >> find task to fail for execution deb6e9dd535069eb66e2139fde5b77cd with >> exception: >> at >> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.Mailbox.exec(Mailbox.scala:235) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.Mailbox.run(Mailbox.scala:225) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.actor.ActorCell.invoke(ActorCell.scala:561) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.actor.Actor$class.aroundReceive(Actor.scala:517) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) >> [flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302] >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> ~[?:1.8.0_302] >> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?] >> at >> org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441) >> ~[flink-dist_2.11-1.13.2.jar:1.13.2] >> >> Thanks for your support. >> Best Regards, >> >> On Thu, 2 Sept 2021 at 16:43, Yun Gao <yungao...@aliyun.com> wrote: >> >>> Hi Xiangyu, >>> >>> There might be different reasons for the "Job Leader... lost leadership" >>> problem. Do you see the erros >>> in the TM log ? If so, the root cause might be that the connection >>> between the TM and ZK is lost or >>> timeout. Have you checked the GC status of the TM side ? If the GC is >>> ok, could you provide more detailed >>> exception stack ? >>> >>> Best, >>> Yun >>> >>> >>> ------------------Original Mail ------------------ >>> *Sender:*Xiangyu Su <xian...@smaato.com> >>> *Send Date:*Wed Sep 1 15:31:03 2021 >>> *Recipients:*user <user@flink.apache.org> >>> *Subject:*FLINK-14316 happens on version 1.13.2 >>> >>>> Hello Everyone, >>>> We upgrade flink to 1.13.2, and we were facing randomly the "Job leader >>>> ... lost leadership" error, the job keep restarting and failing... >>>> It behaviours like this ticket >>>> https://issues.apache.org/jira/browse/FLINK-14316 >>>> >>>> Did anybody had same issue or any suggestions? >>>> >>>> Best Regards, >>>> >>>> >>>> -- >>>> Xiangyu Su >>>> Java Developer >>>> xian...@smaato.com >>>> >>>> Smaato Inc. >>>> San Francisco - New York - Hamburg - Singapore >>>> www.smaato.com >>>> >>>> Germany: >>>> >>>> Barcastraße 5 >>>> >>>> 22087 Hamburg >>>> >>>> Germany >>>> M 0049(176)43330282 >>>> >>>> The information contained in this communication may be CONFIDENTIAL and >>>> is intended only for the use of the recipient(s) named above. If you are >>>> not the intended recipient, you are hereby notified that any dissemination, >>>> distribution, or copying of this communication, or any of its contents, is >>>> strictly prohibited. If you have received this communication in error, >>>> please notify the sender and delete/destroy the original message and any >>>> copy of it from your computer or paper files. >>>> >>> >> >> -- >> Xiangyu Su >> Java Developer >> xian...@smaato.com >> >> Smaato Inc. >> San Francisco - New York - Hamburg - Singapore >> www.smaato.com >> >> Germany: >> >> Barcastraße 5 >> >> 22087 Hamburg >> >> Germany >> M 0049(176)43330282 >> >> The information contained in this communication may be CONFIDENTIAL and >> is intended only for the use of the recipient(s) named above. If you are >> not the intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication, or any of its contents, is >> strictly prohibited. If you have received this communication in error, >> please notify the sender and delete/destroy the original message and any >> copy of it from your computer or paper files. >> > > -- Xiangyu Su Java Developer xian...@smaato.com Smaato Inc. San Francisco - New York - Hamburg - Singapore www.smaato.com Germany: Barcastraße 5 22087 Hamburg Germany M 0049(176)43330282 The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.