Hi Yun,
Thanks alot.
I am running a test, and facing the "Job Leader lost leadership..." issue,
and also the checkpointing timeout at the same time,, not sure whether
those 2 things related to each other.
regarding your question:
1. GC looks ok.
2. seems like once the "Job Leader lost leadership..." happens flink job
can not successfully get restarted.
and e.g here is some logs from one job failure:
---------------
2021-09-02 20:41:11,345 WARN  org.apache.flink.runtime.taskmanager.Task
               [] - KeyedProcess -> Sink: StatsdMetricsSink (40/48)#18
(9ab62cc148569e449fdb31b521ec976c) switched from RUNNING to FAILED with
failure cause: org.apache.flink.util.FlinkException: Disconnect from
JobManager responsible for ec6fd88643747aafac06ee906e421a96.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1660)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1500(TaskExecutor.java:181)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:2189)
at java.util.Optional.ifPresent(Optional.java:159)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:2187)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: java.lang.Exception: Job leader for job id
ec6fd88643747aafac06ee906e421a96 lost leadership.
... 24 more

-----------
2021-09-02 20:47:22,388 ERROR
org.apache.flink.shaded.curator4.org.apache.curator.ConnectionState [] -
Authentication failed
2021-09-02 20:47:22,388 INFO
 org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
Opening socket connection to server dpl-zookeeper-0.dpl-zookeeper/
10.168.175.10:2181
2021-09-02 20:47:22,388 WARN
 org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] -
SASL configuration failed: javax.security.auth.login.LoginException: No
JAAS configuration section named 'Client' was found in specified JAAS
configuration file: '/tmp/jaas-4480663428736118963.conf'. Will continue
connection to Zookeeper server without SASL authentication, if Zookeeper
server allows it.
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_302]
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
org.apache.flink.runtime.jobmaster.ExecutionGraphException: The execution
attempt deb6e9dd535069eb66e2139fde5b77cd was not found.
2021-09-02 20:47:21,870 INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Cannot
find task to fail for execution deb6e9dd535069eb66e2139fde5b77cd with
exception:
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.13.2.jar:1.13.2]
    at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.invoke(ActorCell.scala:561)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:212)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:305)
~[flink-dist_2.11-1.13.2.jar:1.13.2]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_302]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_302]
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source) ~[?:?]
at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:441)
~[flink-dist_2.11-1.13.2.jar:1.13.2]

Thanks for your support.
Best Regards,

On Thu, 2 Sept 2021 at 16:43, Yun Gao <yungao...@aliyun.com> wrote:

> Hi Xiangyu,
>
> There might be different reasons for the "Job Leader... lost leadership"
> problem. Do you see the erros
> in the TM log ? If so, the root cause might be that the connection between
> the TM and ZK is lost or
> timeout. Have you checked the GC status of the TM side ? If the GC is ok,
> could you provide more detailed
> exception stack ?
>
> Best,
> Yun
>
>
> ------------------Original Mail ------------------
> *Sender:*Xiangyu Su <xian...@smaato.com>
> *Send Date:*Wed Sep 1 15:31:03 2021
> *Recipients:*user <user@flink.apache.org>
> *Subject:*FLINK-14316 happens on version 1.13.2
>
>> Hello Everyone,
>> We upgrade flink to 1.13.2, and we were facing randomly the "Job leader
>> ... lost leadership" error, the job keep restarting and failing...
>> It behaviours like this ticket
>> https://issues.apache.org/jira/browse/FLINK-14316
>>
>> Did anybody had same issue or any suggestions?
>>
>> Best Regards,
>>
>>
>> --
>> Xiangyu Su
>> Java Developer
>> xian...@smaato.com
>>
>> Smaato Inc.
>> San Francisco - New York - Hamburg - Singapore
>> www.smaato.com
>>
>> Germany:
>>
>> Barcastraße 5
>>
>> 22087 Hamburg
>>
>> Germany
>> M 0049(176)43330282
>>
>> The information contained in this communication may be CONFIDENTIAL and
>> is intended only for the use of the recipient(s) named above. If you are
>> not the intended recipient, you are hereby notified that any dissemination,
>> distribution, or copying of this communication, or any of its contents, is
>> strictly prohibited. If you have received this communication in error,
>> please notify the sender and delete/destroy the original message and any
>> copy of it from your computer or paper files.
>>
>

-- 
Xiangyu Su
Java Developer
xian...@smaato.com

Smaato Inc.
San Francisco - New York - Hamburg - Singapore
www.smaato.com

Germany:

Barcastraße 5

22087 Hamburg

Germany
M 0049(176)43330282

The information contained in this communication may be CONFIDENTIAL and is
intended only for the use of the recipient(s) named above. If you are not
the intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication, or any of its contents, is
strictly prohibited. If you have received this communication in error,
please notify the sender and delete/destroy the original message and any
copy of it from your computer or paper files.

Reply via email to