Hi Yun, Thanks alot. I am running a test, and facing the "Job Leader lost leadership..." issue, and also the checkpointing timeout at the same time,, not sure whether those 2 things related to each other. regarding your question: 1. GC looks ok. 2. seems like once the "Job Leader lost leadership..." happens flink job can not successfully get restarted. and e.g here is some logs from one job failure: --------------- 2021-09-02 20:41:11,345 WARN org.apache.flink.runtime.taskmanager.Task [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18 (9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for ec6fd88643747aafac06ee906e421a96. at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660) at org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181) at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189) at java.util.Optional.ifPresent(Optional.java:159) at org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at akka.actor.Actor$class.aroundReceive(Actor.scala:517) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) at akka.actor.ActorCell.invoke(ActorCell.scala:561) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) at akka.dispatch.Mailbox.run(Mailbox.scala:225) at akka.dispatch.Mailbox.exec(Mailbox.scala:235) at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: java.lang.Exception: Job leader for job id ec6fd88643747aafac06ee906e421a96 lost leadership. ... 24 more
----------- 2021-09-02 20:47:22,388 ERROR org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] - Authentication failed 2021-09-02 20:47:22,388 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/ 10.168.175.10:2181 2021-09-02 20:47:22,388 WARN org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - SASL configuration failed: javax.security.auth.login.LoginException: No JAAS configuration section named 'Client' was found in specified JAAS configuration file: '/tmp/jaas-4480663428736118963.conf'. Will continue connection to Zookeeper server without SASL authentication, if Zookeeper server allows it. at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.13.2.jar:1.13.2] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.13.2.jar:1.13.2] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.13.2.jar:1.13.2] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.13.2.jar:1.13.2] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.13.2.jar:1.13.2] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.13.2.jar:1.13.2] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[flink-dist_2.11-1.13.2.jar:1.13.2] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212) ~[flink-dist_2.11-1.13.2.jar:1.13.2] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305) ~[flink-dist_2.11-1.13.2.jar:1.13.2] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_302] at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?] at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441) ~[flink-dist_2.11-1.13.2.jar:1.13.2] org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution attempt deb6e9dd535069eb66e2139fde5b77cd was not found. 2021-09-02 20:47:21,870 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Cannot find task to fail for execution deb6e9dd535069eb66e2139fde5b77cd with exception: at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.Mailbox.exec(Mailbox.scala:235) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.Mailbox.run(Mailbox.scala:225) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.actor.ActorCell.invoke(ActorCell.scala:561) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.actor.Actor$class.aroundReceive(Actor.scala:517) [flink-dist_2.11-1.13.2.jar:1.13.2] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.13.2.jar:1.13.2] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) [flink-dist_2.11-1.13.2.jar:1.13.2] at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) [flink-dist_2.11-1.13.2.jar:1.13.2] at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) [flink-dist_2.11-1.13.2.jar:1.13.2] at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) [flink-dist_2.11-1.13.2.jar:1.13.2] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158) ~[flink-dist_2.11-1.13.2.jar:1.13.2] at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77) ~[flink-dist_2.11-1.13.2.jar:1.13.2] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212) ~[flink-dist_2.11-1.13.2.jar:1.13.2] at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305) ~[flink-dist_2.11-1.13.2.jar:1.13.2] at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_302] at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?] at org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441) ~[flink-dist_2.11-1.13.2.jar:1.13.2] Thanks for your support. Best Regards, On Thu, 2 Sept 2021 at 16:43, Yun Gao <yungao...@aliyun.com> wrote: > Hi Xiangyu, > > There might be different reasons for the "Job Leader... lost leadership" > problem. Do you see the erros > in the TM log ? If so, the root cause might be that the connection between > the TM and ZK is lost or > timeout. Have you checked the GC status of the TM side ? If the GC is ok, > could you provide more detailed > exception stack ? > > Best, > Yun > > > ------------------Original Mail ------------------ > *Sender:*Xiangyu Su <xian...@smaato.com> > *Send Date:*Wed Sep 1 15:31:03 2021 > *Recipients:*user <user@flink.apache.org> > *Subject:*FLINK-14316 happens on version 1.13.2 > >> Hello Everyone, >> We upgrade flink to 1.13.2, and we were facing randomly the "Job leader >> ... lost leadership" error, the job keep restarting and failing... >> It behaviours like this ticket >> https://issues.apache.org/jira/browse/FLINK-14316 >> >> Did anybody had same issue or any suggestions? >> >> Best Regards, >> >> >> -- >> Xiangyu Su >> Java Developer >> xian...@smaato.com >> >> Smaato Inc. >> San Francisco - New York - Hamburg - Singapore >> www.smaato.com >> >> Germany: >> >> Barcastraße 5 >> >> 22087 Hamburg >> >> Germany >> M 0049(176)43330282 >> >> The information contained in this communication may be CONFIDENTIAL and >> is intended only for the use of the recipient(s) named above. If you are >> not the intended recipient, you are hereby notified that any dissemination, >> distribution, or copying of this communication, or any of its contents, is >> strictly prohibited. If you have received this communication in error, >> please notify the sender and delete/destroy the original message and any >> copy of it from your computer or paper files. >> > -- Xiangyu Su Java Developer xian...@smaato.com Smaato Inc. San Francisco - New York - Hamburg - Singapore www.smaato.com Germany: Barcastraße 5 22087 Hamburg Germany M 0049(176)43330282 The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.