Hi, Arnaud
Would you like to share the log of the shutdown task executor?
BTW could you check the gc log of the task executor?
Best,
Guowei


On Mon, Nov 16, 2020 at 8:57 PM LINZ, Arnaud <al...@bouyguestelecom.fr>
wrote:

> (reposted with proper subject line -- sorry for the copy/paste)
> -----Original message-----
> Hello,
>
> I'm running Flink 1.10 on a yarn cluster. I have a streaming application,
> that, when under heavy load, fails from time to time with this unique error
> message in the whole yarn log:
>
> (...)
> 2020-11-15 16:18:42,202 WARN
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Received
> late message for now expired checkpoint attempt 63 from task
> 4cbc940112a596db54568b24f9209aac of job 1e1717d19bd8ea296314077e42e1c7e5 at
> container_e38_1604477334666_0960_01_000004 @ xxx (dataPort=33099).
> 2020-11-15 16:18:55,043 INFO  org.apache.flink.yarn.YarnResourceManager
>                  - Closing TaskExecutor connection
> container_e38_1604477334666_0960_01_000004 because: The TaskExecutor is
> shutting down.
> 2020-11-15 16:18:55,087 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Map (7/15)
> (c8e92cacddcd4e41f51a2433d07d2153) switched from RUNNING to FAILED.
> org.apache.flink.util.FlinkException: The TaskExecutor is shutting down.
>
>       at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:359)
>         at
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:218)
>         at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509)
>         at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175)
>         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>         at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>         at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>         at akka.japi.pf
> .UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>         at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>         at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>         at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>         at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>         at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>         at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>         at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>         at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>         at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>         at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>         at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-11-15 16:18:55,092 INFO
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
> - Calculating tasks to restart to recover the failed task
> 2f6467d98899e64a4721f0a7b6a059a8_6.
> 2020-11-15 16:18:55,101 INFO
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy
> - 230 tasks should be restarted to recover the failed task
> 2f6467d98899e64a4721f0a7b6a059a8_6.
> (...)
>
> What could be the cause of this failure? Why is there no other error
> message?
>
> I've tried to increase the value of heartbeat.timeout, thinking that maybe
> it was due to a slow responding mapper, but it did not solve the issue.
>
> Best regards,
> Arnaud
>
> ________________________________
>
> L'intégrité de ce message n'étant pas assurée sur internet, la société
> expéditrice ne peut être tenue responsable de son contenu ni de ses pièces
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si
> vous n'êtes pas destinataire de ce message, merci de le détruire et
> d'avertir l'expéditeur.
>
> The integrity of this message cannot be guaranteed on the Internet. The
> company that sent this message cannot therefore be held liable for its
> content nor attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of this message, then
> please delete it and notify the sender.
>

Reply via email to