Hi, Xintong

Thanks for replying and your suggestion. I did check the ZK side but there
is nothing interesting. The error message actually shows that only one TM
thought JM lost leadership while others ran fine. Also, this happened only
after we migrated from 1.9 to 1.11. Is it possible this is introduced by
1.11?

Best
Lu

On Wed, Dec 16, 2020 at 5:56 PM Xintong Song <tonysong...@gmail.com> wrote:

> Hi Lu,
>
> I assume you are using ZooKeeper as the HA service?
>
> A common cause of unexpected leadership lost is the instability of HA
> service. E.g., if ZK does not receive heartbeat from Flink RM for a
> certain period of time, it will revoke the leadership and notify
> other components. You can look into the ZooKeeper logs checking why RM's
> leadership is revoked.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Dec 17, 2020 at 8:42 AM Lu Niu <qqib...@gmail.com> wrote:
>
>> Hi, Flink users
>>
>> Recently we migrated to flink 1.11 and see exceptions like:
>> ```
>> 2020-12-15 12:41:01,199 INFO
>>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Source:
>> USER_MATERIALIZED_EVENT_SIGNAL-user_context-event ->
>> USER_MATERIALIZED_EVENT_SIGNAL-user_context-event-as_nrtgtuple (21/60)
>> (711d1d319691a4b80e30fe6ab7dfab5b) switched from RUNNING to FAILED on
>> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@50abf386.
>> java.lang.Exception: Job leader for job id
>> 47b1531f79ffe3b86bc5910f6071e40c lost leadership.
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$null$2(TaskExecutor.java:1852)
>> ~[nrtg-1.11_deploy.jar:?]
>> at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_212-ga]
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobLeaderListenerImpl.lambda$jobManagerLostLeadership$3(TaskExecutor.java:1851)
>> ~[nrtg-1.11_deploy.jar:?]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
>> ~[nrtg-1.11_deploy.jar:?]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
>> ~[nrtg-1.11_deploy.jar:?]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>> ~[nrtg-1.11_deploy.jar:?]
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>> [nrtg-1.11_deploy.jar:?]
>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>> [nrtg-1.11_deploy.jar:?]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>> [nrtg-1.11_deploy.jar:?]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> [nrtg-1.11_deploy.jar:?]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:539)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:227)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.actor.ActorCell.invoke(ActorCell.scala:581)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.dispatch.Mailbox.run(Mailbox.scala:229) [nrtg-1.11_deploy.jar:?]
>> ```
>>
>> ```
>> 2020-12-15 01:01:39,531 INFO
>>  org.apache.flink.runtime.executiongraph.ExecutionGraph       [] -
>> USER_MATERIALIZED_EVENT_SIGNAL.user_context.SINK-stream_joiner ->
>> USER_MATERIALIZED_EVENT_SIGNAL-user_context-SINK-SINKS.realpin (260/360)
>> (0c1f4495088ec9452c597f46a88a2c8e) switched from RUNNING to FAILED on
>> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@2362b2fd.
>> org.apache.flink.util.FlinkException: ResourceManager leader changed to
>> new address null
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.notifyOfNewResourceManagerLeader(TaskExecutor.java:1093)
>> ~[nrtg-1.11_deploy.jar:?]
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$800(TaskExecutor.java:173)
>> ~[nrtg-1.11_deploy.jar:?]
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor$ResourceManagerLeaderListener.lambda$notifyLeaderAddress$0(TaskExecutor.java:1816)
>> ~[nrtg-1.11_deploy.jar:?]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
>> ~[nrtg-1.11_deploy.jar:?]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
>> ~[nrtg-1.11_deploy.jar:?]
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>> ~[nrtg-1.11_deploy.jar:?]
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>> [nrtg-1.11_deploy.jar:?]
>> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>> [nrtg-1.11_deploy.jar:?]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>> [nrtg-1.11_deploy.jar:?]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> [nrtg-1.11_deploy.jar:?]
>> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:539)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:227)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.actor.ActorCell.invoke(ActorCell.scala:581)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268)
>> [nrtg-1.11_deploy.jar:?]
>> at akka.dispatch.Mailbox.run(Mailbox.scala:229) [nrtg-1.11_deploy.jar:?]
>> ```
>>
>> This happens a few times per week. It seems like one Task Manager wrongly
>> thought JobMananger is lost and triggers a full restart of the whole job.
>> Does anyone know how to resolve such errors? Thanks!
>>
>> Best
>> Lu
>>
>

Reply via email to