Hi Amran,

Did you monitor or have a look at your memory metrics(e.g. full GC) of your
TM.

There is a similar thread that a user reported the same question due to
full GC, the link is here[1].

Best,
Vino

[1]:
http://mail-archives.apache.org/mod_mbox/flink-user/201709.mbox/%3cfa4068d3-2696-4f29-857d-4a7741c3e...@greghogan.com%3E

amran dean <adfs54545...@gmail.com> 于2019年11月22日周五 上午7:15写道:

> Hello,
> I am frequently seeing this error in my jobmanager logs:
>
> 2019-11-18 09:07:08,863 INFO
>  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source:
> kdi_g2 -> (Sink: s_id_metadata, Sink: t_id_metadata) (23/24)
> (5e10e88814fe4ab0be5f554ec59bd93d) switched from RUNNING to FAILED.
> java.lang.Exception: Container released on a *lost* node
> at
> org.apache.flink.yarn.YarnResourceManager.lambda$onContainersCompleted$0(YarnResourceManager.java:370)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at akka.actor.Actor.aroundReceive(Actor.scala:517)
> at akka.actor.Actor.aroundReceive$(Actor.scala:515)
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> at akka.actor.ActorCell.invoke(ActorCell.scala:561)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Yarn nodemanager logs don't show anything out of the ordinary when such
> exceptions occur, so I am suspecting it is a timeout of some sort due to
> network problems. (YARN is not killing the container ). It's difficult to
> diagnose because Flink doesn't give any reason for losing the node. Is it a
> timeout? OOM?
>
>  Is there a corresponding configuration I should tune? What is the
> recommended course of action?
>

Reply via email to