[ https://issues.apache.org/jira/browse/FLINK-14038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933289#comment-16933289 ]
liupengcheng commented on FLINK-14038: -------------------------------------- [~trohrmann] This is an example of the JVM GC logger logs: {code:java} 2019-09-18T13:41:37.503+0800: 119.744: [Full GC (Ergonomics) [PSYoungGen: 40960K->39766K(81920K)] [ParOldGen: 245683K->245683K(245760K)] 286643K->285450K(327680K), [Metaspace: 56311K->56311K(1099776K)], 0.0963178 secs] [Times: user=0.22 sys=0.00, real=0.10 secs] 2019-09-18T13:41:37.599+0800: 119.841: Total time for which application threads were stopped: 0.0967768 seconds, Stopping threads took: 0.0001084 seconds {code} Compared to MemroyLogger, it provide some extra informations: # space size changes of each area on every GC # GCCause # more fine-grained level: for each GC I agreed with you proposal, I will try to work on this, and make it works for all deployment options. > ExecutionGraph deploy failed due to akka timeout > ------------------------------------------------ > > Key: FLINK-14038 > URL: https://issues.apache.org/jira/browse/FLINK-14038 > Project: Flink > Issue Type: Bug > Components: Runtime / Task > Affects Versions: 1.9.0 > Environment: Flink on yarn > Flink 1.9.0 > Reporter: liupengcheng > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When launching the flink application, the following error was reported, I > downloaded the operator logs, but still have no clue. The operator logs > provided no useful information and was cancelled directly. > JobManager logs: > {code:java} > java.lang.IllegalStateException: Update task on TaskManager > container_e860_1567429198842_571077_01_000006 @ zjy-hadoop-prc-st320.bj > (dataPort=50990) failed due to: > at > org.apache.flink.runtime.executiongraph.Execution.lambda$sendUpdatePartitionInfoRpcCall$14(Execution.java:1395) > at > java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) > at > java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) > at > java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at akka.actor.Actor$class.aroundReceive(Actor.scala:517) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) > at akka.actor.ActorCell.invoke(ActorCell.scala:561) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) > at akka.dispatch.Mailbox.run(Mailbox.scala:225) > at akka.dispatch.Mailbox.exec(Mailbox.scala:235) > at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: java.util.concurrent.CompletionException: > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]] > after [10000 ms]. Message of type > [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason > for `AskTimeoutException` is that the recipient actor didn't send a reply. > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:871) > at akka.dispatch.OnComplete.internal(Future.scala:263) > at akka.dispatch.OnComplete.internal(Future.scala:261) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:644) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) > at java.lang.Thread.run(Thread.java:748) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://fl...@zjy-hadoop-prc-st320.bj:62051/user/taskmanager_0#-171547157]] > after [10000 ms]. Message of type > [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason > for `AskTimeoutException` is that the recipient actor didn't send a reply. > at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > at akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) > ... 9 more > {code} > operator logs: > {code:java} > 2019-09-09 18:34:06,867 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Received task > Partition (4/5). > 2019-09-09 18:34:06,868 INFO org.apache.flink.runtime.taskmanager.Task > - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched > from CREATED to DEPLOYING. > 2019-09-09 18:34:06,870 INFO org.apache.flink.runtime.taskmanager.Task > - Creating FileSystem stream leak safety net for task Partition > (4/5) (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING] > 2019-09-09 18:34:06,870 INFO org.apache.flink.runtime.taskmanager.Task > - Loading JAR files for task Partition (4/5) > (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING]. > 2019-09-09 18:34:06,871 INFO org.apache.flink.runtime.taskmanager.Task > - Registering task at network: Partition (4/5) > (97d7df744b93f4ee46750bbd6a0113e8) [DEPLOYING]. > 2019-09-09 18:34:07,075 INFO org.apache.flink.runtime.taskmanager.Task > - Partition (4/5) (97d7df744b93f4ee46750bbd6a0113e8) switched > from DEPLOYING to RUNNING. > 2019-09-09 18:34:07,255 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Received task > Sort-Partition (4/5). > 2019-09-09 18:34:07,258 INFO org.apache.flink.runtime.taskmanager.Task > - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) > switched from CREATED to DEPLOYING. > 2019-09-09 18:34:07,261 INFO org.apache.flink.runtime.taskmanager.Task > - Creating FileSystem stream leak safety net for task > Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING] > 2019-09-09 18:34:07,261 INFO org.apache.flink.runtime.taskmanager.Task > - Loading JAR files for task Sort-Partition (4/5) > (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING]. > 2019-09-09 18:34:07,263 INFO org.apache.flink.runtime.taskmanager.Task > - Registering task at network: Sort-Partition (4/5) > (a721ca202bc8bf2e2aa4b41b1e4a1091) [DEPLOYING]. > 2019-09-09 18:34:07,303 INFO org.apache.flink.runtime.taskmanager.Task > - Sort-Partition (4/5) (a721ca202bc8bf2e2aa4b41b1e4a1091) > switched from DEPLOYING to RUNNING. > 2019-09-09 18:34:54,625 INFO org.apache.flink.runtime.taskmanager.Task > - Attempting to cancel task DataSource (at > org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) > (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) > (8c6262b3f802f82d60a1999f2e040a68). > 2019-09-09 18:34:54,806 INFO org.apache.flink.runtime.taskmanager.Task > - DataSource (at > org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) > (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) > (8c6262b3f802f82d60a1999f2e040a68) switched from RUNNING to CANCELING. > 2019-09-09 18:34:54,806 INFO org.apache.flink.runtime.taskmanager.Task > - Triggering cancellation of task code DataSource (at > org.apache.flink.api.scala.ExecutionEnvironment.createInput(ExecutionEnvironment.scala:390) > (org.apache.flink.api.scala.hadoop.mapreduce.HadoopInpu) (5/5) > (8c6262b3f802f82d60a1999f2e040a68). > {code} > I checked the network and it's good. so maybe there are some problems with > the taskManager? -- This message was sent by Atlassian Jira (v8.3.4#803005)