[ https://issues.apache.org/jira/browse/FLINK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926291#comment-16926291 ]
Zhu Zhu commented on FLINK-14038: --------------------------------- Hi [~liupengcheng], would you check the TM GC log to see whether the TM was stuck in GC when this error happens? GC problem is the most common cause for late response from TM. You can also increase the config "akka.ask.timeout" (by default it is 10 s) to make the job more robust for late response cases. > ExecutionGraph deploy failed due to akka timeout > ------------------------------------------------ > > Key: FLINK-14038 > URL: https://issues.apache.org/jira/browse/FLINK-14038 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.9.0 > Environment: Flink on yarn > Flink 1.9.0 > Reporter: liupengcheng > Priority: Major > > When launching the flink application, the following error was reported, I > downloaded the operator logs, but still have no clue. The operator logs > provided no useful information and was cancelled directly. > JobManager logs: > {code:java} > java.lang.IllegalStateException: Update task on TaskManager > container_e860_1567429198842_571077_01_000006 @ zjy-hadoop-prc-st320.bj > (dataPort=50990) failed due to: > at > org.apache.flink.runtime.executiongraph.Execution.lambda$sendUpdatePartitionInfoRpcCall$14(Execution.java:1395) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.util.concurrent.CompletionException: > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]] > after [10000 ms]. Message of type > [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason > for `AskTimeoutException` is that the recipient actor didn't send a reply. > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871) > at akka.dispatch.OnComplete.internal(Future.scala:263) > at akka.dispatch.OnComplete.internal(Future.scala:261) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) > at java.lang.Thread.run(Thread.java:748) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]] > after [10000 ms]. Message of type > [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason > for `AskTimeoutException` is that the recipient actor didn't send a reply. > at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) > ... 9 more > {code} > operator logs: > {code:java} > 2019-09-09 18:34:06,867 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Received task > Partition (4/5). > 2019-09-09 18:34:06,868 INFO org.apache.flink.runtime.taskmanager.Task > - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched > from CREATED to DEPLOYING. > 2019-09-09 18:34:06,870 INFO org.apache.flink.runtime.taskmanager.Task > - Creating FileSystem stream leak safety net for task Partition > (4/5) (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING] > 2019-09-09 18:34:06,870 INFO org.apache.flink.runtime.taskmanager.Task > - Loading JAR files for task Partition (4/5) > (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING]. > 2019-09-09 18:34:06,871 INFO org.apache.flink.runtime.taskmanager.Task > - Registering task at network: Partition (4/5) > (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING]. > 2019-09-09 18:34:07,075 INFO org.apache.flink.runtime.taskmanager.Task > - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched > from DEPLOYING to RUNNING. > 2019-09-09 18:34:07,255 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Received task > Sort-Partition (4/5). > 2019-09-09 18:34:07,258 INFO org.apache.flink.runtime.taskmanager.Task > - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) > switched from CREATED to DEPLOYING. > 2019-09-09 18:34:07,261 INFO org.apache.flink.runtime.taskmanager.Task > - Creating FileSystem stream leak safety net for task > Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING] > 2019-09-09 18:34:07,261 INFO org.apache.flink.runtime.taskmanager.Task > - Loading JAR files for task Sort-Partition (4/5) > (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING]. > 2019-09-09 18:34:07,263 INFO org.apache.flink.runtime.taskmanager.Task > - Registering task at network: Sort-Partition (4/5) > (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING]. > 2019-09-09 18:34:07,303 INFO org.apache.flink.runtime.taskmanager.Task > - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) > switched from DEPLOYING to RUNNING. > 2019-09-09 18:34:54,625 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to cancel task DataSource (at > org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) > (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) > (8c6262b3f802f82d60a1999f2e040a68). > 2019-09-09 18:34:54,806 INFO org.apache.flink.runtime.taskmanager.Task > - DataSource (at > org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) > (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) > (8c6262b3f802f82d60a1999f2e040a68) switched from RUNNING to CANCELING. > 2019-09-09 18:34:54,806 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code DataSource (at > org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) > (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) > (8c6262b3f802f82d60a1999f2e040a68). > {code} > I checked the network and it's good. so maybe there are some problems with > the taskManager? -- This message was sent by Atlassian Jira (v8.3.2#803003)