Hi Arnaud, It seems that the TaskExecutor terminated exceptionally. I think you need to check the logs of container_e38_1604477334666_0960_01_000004 to figure out why it crashed or shut down.
Best, Yang LINZ, Arnaud <al...@bouyguestelecom.fr> 于2020年11月16日周一 下午7:11写道: > Hello, > > I'm running Flink 1.10 on a yarn cluster. I have a streaming application, > that, when under heavy load, fails from time to time with this unique error > message in the whole yarn log: > > (...) > 2020-11-15 16:18:42,202 WARN > org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Received > late message for now expired checkpoint attempt 63 from task > 4cbc940112a596db54568b24f9209aac of job 1e1717d19bd8ea296314077e42e1c7e5 at > container_e38_1604477334666_0960_01_000004 @ xxx (dataPort=33099). > 2020-11-15 16:18:55,043 INFO org.apache.flink.yarn.YarnResourceManager > - Closing TaskExecutor connection > container_e38_1604477334666_0960_01_000004 because: The TaskExecutor is > shutting down. > 2020-11-15 16:18:55,087 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - Map (7/15) > (c8e92cacddcd4e41f51a2433d07d2153) switched from RUNNING to FAILED. > org.apache.flink.util.FlinkException: The TaskExecutor is shutting down. > > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:359) > at > org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:218) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:509) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:175) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf > .UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > 2020-11-15 16:18:55,092 INFO > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy > - Calculating tasks to restart to recover the failed task > 2f6467d98899e64a4721f0a7b6a059a8_6. > 2020-11-15 16:18:55,101 INFO > org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionStrategy > - 230 tasks should be restarted to recover the failed task > 2f6467d98899e64a4721f0a7b6a059a8_6. > (...) > > What could be the cause of this failure? Why is there no other error > message? > > I've tried to increase the value of heartbeat.timeout, thinking that maybe > it was due to a slow responding mapper, but it did not solve the issue. > > Best regards, > Arnaud > > ________________________________ > > L'intégrité de ce message n'étant pas assurée sur internet, la société > expéditrice ne peut être tenue responsable de son contenu ni de ses pièces > jointes. Toute utilisation ou diffusion non autorisée est interdite. Si > vous n'êtes pas destinataire de ce message, merci de le détruire et > d'avertir l'expéditeur. > > The integrity of this message cannot be guaranteed on the Internet. The > company that sent this message cannot therefore be held liable for its > content nor attachments. Any unauthorized use or dissemination is > prohibited. If you are not the intended recipient of this message, then > please delete it and notify the sender. >