更正,这个是akka timeout exception
java.util.concurrent.CompletionException:
org.apache.flink.client.deployment.application.ApplicationExecutionException:
Could not execute application.
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
~[?:1.8.0_282]
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
~[?:1.8.0_282]
at
java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957)
~[?:1.8.0_282]
at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940)
~[?:1.8.0_282]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
~[?:1.8.0_282]
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
~[?:1.8.0_282]
at
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:257)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$runApplicationAsync$1(ApplicationDispatcherBootstrap.java:212)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[?:1.8.0_282]
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[?:1.8.0_282]
at
org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:159)
[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[flink-dist_2.11-1.12.2.jar:1.12.2]
Caused by:
org.apache.flink.client.deployment.application.ApplicationExecutionException:
Could not execute application.
... 11 more
Caused by: org.apache.flink.client.program.ProgramInvocationException: The main
method caused an error: java.util.concurrent.TimeoutException: Invocation of
public default java.util.concurrent.CompletableFuture
org.apache.flink.runtime.webmonitor.RestfulGateway.requestJobStatus(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
timed out.
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:366)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:219)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:242)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
... 10 more
Caused by: java.util.concurrent.ExecutionException:
java.util.concurrent.TimeoutException: Invocation of public default
java.util.concurrent.CompletableFuture
org.apache.flink.runtime.webmonitor.RestfulGateway.requestJobStatus(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
timed out.
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_282]
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
~[?:1.8.0_282]
at
org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironment.java:123)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:80)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.playgrounds.ops.clickcount.ClickEventCount.main(ClickEventCount.java:112)
~[?:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[?:1.8.0_282]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[?:1.8.0_282]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_282]
at java.lang.reflect.Method.invoke(Method.java:498)
~[?:1.8.0_282]
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:349)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:219)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:242)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
... 10 more
Caused by: java.util.concurrent.TimeoutException: Invocation of public default
java.util.concurrent.CompletableFuture
org.apache.flink.runtime.webmonitor.RestfulGateway.requestJobStatus(org.apache.flink.api.common.JobID,org.apache.flink.api.common.time.Time)
timed out.
at
org.apache.flink.runtime.rpc.akka.$Proxy36.requestJobStatus(Unknown Source)
~[?:1.12.2]
at
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$getJobResult$0(JobStatusPollingUtils.java:57)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.deployment.application.JobStatusPollingUtils.pollJobResultAsync(JobStatusPollingUtils.java:87)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
org.apache.flink.client.deployment.application.JobStatusPollingUtils.lambda$null$3(JobStatusPollingUtils.java:107)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
... 9 more
Caused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/rpc/dispatcher_1#1531007562]] after [60000 ms].
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A
typical reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.
at
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
~[flink-dist_2.11-1.12.2.jar:1.12.2]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_282]
From: Chenyu Zheng <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Tuesday, August 3, 2021 at 2:04 PM
To: "[email protected]" <[email protected]>
Subject: 几个Flink 1.12. 2超时问题
开发者您好,
我正在尝试在Kubernetes上部署Flink 1.12.2, 使用的是native
application部署模式。但是在测试中发现,当将作业并行度调大之后,各种timeout时有发生。根据监控看,JM和TM容器的cpu和内存都没有使用到k8s给分配的量。
在尝试调大akka.ask.timeout至1分钟,和heartbeat.timeout至2分钟之后,各种超时现象得以缓解。
我的问题是,当设置较大并行度(比如128)时,akka超时和心跳超时的各种现象都是正常的吗?如果不正常,需要用什么方式去troubleshot问题的根源呢?另外单纯一味调大各个组件的超时时间,会带来什么负面作用呢?
附件中有akka超时的jobmanager日志,TaskManager心跳超时日志稍后会发上来。
谢谢!