Hi, Flink users

Recently we migrated to flink 1.11 and see exceptions like:
```
2020-12-15 12:41:01,199 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
USER_MATERIALIZED_EVENT_SIGNAL-user_context-event ->
USER_MATERIALIZED_EVENT_SIGNAL-user_context-event-as_nrtgtuple (21/60)
(711d1d319691a4b80e30fe6ab7dfab5b) switched from RUNNING to FAILED on
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@50abf386.
java.lang.Exception: Job leader for job id 47b1531f79ffe3b86bc5910f6071e40c
lost leadership.
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1852)
~[nrtg-1.11_deploy.jar:?]
at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_212-ga]
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:1851)
~[nrtg-1.11_deploy.jar:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[nrtg-1.11_deploy.jar:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[nrtg-1.11_deploy.jar:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[nrtg-1.11_deploy.jar:?]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
[nrtg-1.11_deploy.jar:?]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
[nrtg-1.11_deploy.jar:?]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[nrtg-1.11_deploy.jar:?]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
[nrtg-1.11_deploy.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[nrtg-1.11_deploy.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[nrtg-1.11_deploy.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[nrtg-1.11_deploy.jar:?]
at akka.actor.Actor$class.aroundReceive(Actor.scala:539)
[nrtg-1.11_deploy.jar:?]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:227)
[nrtg-1.11_deploy.jar:?]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612)
[nrtg-1.11_deploy.jar:?]
at akka.actor.ActorCell.invoke(ActorCell.scala:581) [nrtg-1.11_deploy.jar:?]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
[nrtg-1.11_deploy.jar:?]
at akka.dispatch.Mailbox.run(Mailbox.scala:229) [nrtg-1.11_deploy.jar:?]
```

```
2020-12-15 01:01:39,531 INFO
 org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
USER_MATERIALIZED_EVENT_SIGNAL.user_context.SINK-stream_joiner ->
USER_MATERIALIZED_EVENT_SIGNAL-user_context-SINK-SINKS.realpin (260/360)
(0c1f4495088ec9452c597f46a88a2c8e) switched from RUNNING to FAILED on
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@2362b2fd.
org.apache.flink.util.FlinkException: ResourceManager leader changed to new
address null
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093)
~[nrtg-1.11_deploy.jar:?]
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173)
~[nrtg-1.11_deploy.jar:?]
at
org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816)
~[nrtg-1.11_deploy.jar:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
~[nrtg-1.11_deploy.jar:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
~[nrtg-1.11_deploy.jar:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
~[nrtg-1.11_deploy.jar:?]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
[nrtg-1.11_deploy.jar:?]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
[nrtg-1.11_deploy.jar:?]
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
[nrtg-1.11_deploy.jar:?]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
[nrtg-1.11_deploy.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
[nrtg-1.11_deploy.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[nrtg-1.11_deploy.jar:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
[nrtg-1.11_deploy.jar:?]
at akka.actor.Actor$class.aroundReceive(Actor.scala:539)
[nrtg-1.11_deploy.jar:?]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:227)
[nrtg-1.11_deploy.jar:?]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612)
[nrtg-1.11_deploy.jar:?]
at akka.actor.ActorCell.invoke(ActorCell.scala:581) [nrtg-1.11_deploy.jar:?]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
[nrtg-1.11_deploy.jar:?]
at akka.dispatch.Mailbox.run(Mailbox.scala:229) [nrtg-1.11_deploy.jar:?]
```

This happens a few times per week. It seems like one Task Manager wrongly
thought JobMananger is lost and triggers a full restart of the whole job.
Does anyone know how to resolve such errors? Thanks!

Best
Lu

Reply via email to