[jira] [Commented] (FLINK-14038) ExecutionGraph deploy failed due to akka timeout

Till Rohrmann (Jira) Thu, 19 Sep 2019 01:18:19 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933159#comment-16933159
 ]


Till Rohrmann commented on FLINK-14038:
---------------------------------------

I see your point [~liupengcheng] and I think you are right that it makes it 
easier to debug Flink problems. It looks as if gc logging is not prohibitively 
expensive and, hence, could be activated. On the other hand we also have the 
{{MemoryLogger}} which can be enabled via {{taskmanager.debug.memory.log}}. It 
uses the {{MemoryMXBean}} to report the current memory statistics. Do you know 
what the JVM GC logger logs additionally?

For {{-XX:HeapDumpOnOutOfMemoryError}}, it should not any add additional 
overhead as far as I can tell. So apart from the disk space occupied by heap 
dumps there should not be a big problem with it.

What about the following proposal: We make both things configurable and enable 
the heap dump per default and disable the GC logging as we do it wit the 
{{MemoryLogger}}. If we do this, then we should also make sure that the head 
dumping and GC logging works for all deployment options (Yarn, Mesos, 
Standalone).

> ExecutionGraph deploy failed due to akka timeout
> ------------------------------------------------
>
>                 Key: FLINK-14038
>                 URL: https://issues.apache.org/jira/browse/FLINK-14038
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.9.0
>         Environment: Flink on yarn
> Flink 1.9.0
>            Reporter: liupengcheng
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When launching the flink application, the following error was reported, I 
> downloaded the operator logs, but still have no clue. The operator logs 
> provided no useful information and was cancelled directly.
> JobManager logs:
> {code:java}
> java.lang.IllegalStateException: Update task on TaskManager 
> container_e860_1567429198842_571077_01_000006 @ zjy-hadoop-prc-st320.bj 
> (dataPort=50990) failed due to:
>       at 
> org.apache.flink.runtime.executiongraph.Execution.lambda$sendUpdatePartitionInfoRpcCall$14(Execution.java:1395)
>       at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>       at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>       at 
> java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>       at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>       at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>       at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: java.util.concurrent.CompletionException: 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]]
>  after [10000 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
>       at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>       at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>       at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>       at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>       at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>       at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>       at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871)
>       at akka.dispatch.OnComplete.internal(Future.scala:263)
>       at akka.dispatch.OnComplete.internal(Future.scala:261)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
>       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
>       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>       at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
>       at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>       at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644)
>       at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>       at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>       at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>       at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
>       at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
>       at java.lang.Thread.run(Thread.java:748)
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]]
>  after [10000 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
>       at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
>       at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
>       at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
>       ... 9 more
> {code}
> operator logs:
> {code:java}
> 2019-09-09 18:34:06,867 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Received task 
> Partition (4/5).
> 2019-09-09 18:34:06,868 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched 
> from CREATED to DEPLOYING.
> 2019-09-09 18:34:06,870 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Creating FileSystem stream leak safety net for task Partition 
> (4/5) (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING]
> 2019-09-09 18:34:06,870 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Loading JAR files for task Partition (4/5) 
> (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING].
> 2019-09-09 18:34:06,871 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Registering task at network: Partition (4/5) 
> (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING].
> 2019-09-09 18:34:07,075 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched 
> from DEPLOYING to RUNNING.
> 2019-09-09 18:34:07,255 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Received task 
> Sort-Partition (4/5).
> 2019-09-09 18:34:07,258 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) 
> switched from CREATED to DEPLOYING.
> 2019-09-09 18:34:07,261 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Creating FileSystem stream leak safety net for task 
> Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING]
> 2019-09-09 18:34:07,261 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Loading JAR files for task Sort-Partition (4/5) 
> (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING].
> 2019-09-09 18:34:07,263 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Registering task at network: Sort-Partition (4/5) 
> (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING].
> 2019-09-09 18:34:07,303 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) 
> switched from DEPLOYING to RUNNING.
> 2019-09-09 18:34:54,625 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Attempting to cancel task DataSource (at 
> org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390)
>  (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) 
> (8c6262b3f802f82d60a1999f2e040a68).
> 2019-09-09 18:34:54,806 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - DataSource (at 
> org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390)
>  (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) 
> (8c6262b3f802f82d60a1999f2e040a68) switched from RUNNING to CANCELING.
> 2019-09-09 18:34:54,806 INFO  org.apache.flink.runtime.taskmanager.Task       
>               - Triggering cancellation of task code DataSource (at 
> org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390)
>  (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) 
> (8c6262b3f802f82d60a1999f2e040a68).
> {code}
> I checked the network and it's good. so maybe there are some problems with 
> the taskManager? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-14038) ExecutionGraph deploy failed due to akka timeout

Reply via email to