[ 
https://issues.apache.org/jira/browse/FLINK-11143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17156363#comment-17156363
 ] 

Steven Zhen Wu edited comment on FLINK-11143 at 7/12/20, 6:39 PM:
------------------------------------------------------------------

[~trohrmann] I am seeing a similar problem *when trying unaligned checkpoint 
with 1.11.0*. The Flink job actually started fine. We didn't see this 
AskTimeoutException thrown during job submission in without unaligned 
checkpoint (1.10 or 1.11).

Some more context about the app
 * a large-state stream join app (a few TBs)
 * parallelism 1,440
 * number of containers: 180
 * Cores per container: 12
 * TM_TASK_SLOTS: 8
 * akka.ask.timeout: 120 s
 * heartbeat.timeout: 120000
 * web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without 
any difference)

I will send you the log files (with DEBUG level) in an email offline. Thanks a 
lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server 
side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute 
application.\\n\\tat 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat
 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat
 
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat
 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat
 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat
 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat
 java.lang.Thread.run(Thread.java:748)\\nCaused by: 
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Failed to execute job 
'personalization-streaming-impressions-alt'.\\n\\tat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat
 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat
 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat
 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t...
 10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute 
job 'personalization-streaming-impressions-alt'.\\n\\tat 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat
 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat
 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat
 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat
 
com.netflix.spaas.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat
 
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat
 
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat
 java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t...
 13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of 
public abstract java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time)
 timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat
 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat
 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat
 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t...
 24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms]. 
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A 
typical reason for `AskTimeoutException` is that the recipient actor didn't 
send a reply.\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat
 akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat
 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat
 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t...
 1 more\\n\
 {code}
 


was (Author: stevenz3wu):
[~trohrmann] I am seeing a similar problem when trying unaligned checkpoint 
with 1.11.0. The Flink job actually started fine. We didn't see this 
AskTimeoutException thrown during job submission in without unaligned 
checkpoint (1.10 or 1.11).

Some more context about the app
 * a large-state stream join app (a few TBs)
 * parallelism 1,440
 * number of containers: 180
 * Cores per container: 12
 * TM_TASK_SLOTS: 8
 * akka.ask.timeout: 120 s
 * heartbeat.timeout: 120000
 * web.timeout: 60000 (also tried larger values like 300,000 or 600,000 without 
any difference)

I will send you the log files (with DEBUG level) in an email offline. Thanks a 
lot for your help in advance!
{code:java}
\"errors\":[\"Internal server error.\",\"<Exception on server 
side:\\norg.apache.flink.util.FlinkRuntimeException: Could not execute 
application.\\n\\tat 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:81)\\n\\tat
 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.run(DetachedApplicationRunner.java:67)\\n\\tat
 
org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$0(JarRunHandler.java:99)\\n\\tat
 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)\\n\\tat
 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)\\n\\tat 
java.util.concurrent.FutureTask.run(FutureTask.java:266)\\n\\tat 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)\\n\\tat
 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)\\n\\tat
 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\\n\\tat
 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\\n\\tat
 java.lang.Thread.run(Thread.java:748)\\nCaused by: 
org.apache.flink.client.program.ProgramInvocationException: The main method 
caused an error: Failed to execute job 
'personalization-streaming-impressions-alt'.\\n\\tat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:302)\\n\\tat
 
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:198)\\n\\tat
 
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:149)\\n\\tat
 
org.apache.flink.client.deployment.application.DetachedApplicationRunner.tryExecuteJobs(DetachedApplicationRunner.java:78)\\n\\t...
 10 more\\nCaused by: org.apache.flink.util.FlinkException: Failed to execute 
job 'personalization-streaming-impressions-alt'.\\n\\tat 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1823)\\n\\tat
 
org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:128)\\n\\tat
 
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)\\n\\tat
 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1699)\\n\\tat
 
com.netflix.spaas.application.SpaasBaseApplication.execute(SpaasBaseApplication.java:54)\\n\\tat
 
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain$.main(ImpressionsJobMain.scala:12)\\n\\tat
 
com.netflix.dea.paa.streaming.impressions.ImpressionsJobMain.main(ImpressionsJobMain.scala)\\n\\tat
 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\\n\\tat 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\\n\\tat
 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat
 java.lang.reflect.Method.invoke(Method.java:498)\\n\\tat 
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:288)\\n\\t...
 13 more\\nCaused by: java.util.concurrent.TimeoutException: Invocation of 
public abstract java.util.concurrent.CompletableFuture 
org.apache.flink.runtime.dispatcher.DispatcherGateway.submitJob(org.apache.flink.runtime.jobgraph.JobGraph,org.apache.flink.api.common.time.Time)
 timed out.\\n\\tat com.sun.proxy.$Proxy113.submitJob(Unknown Source)\\n\\tat 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.lambda$submitJob$4(EmbeddedExecutor.java:158)\\n\\tat
 
java.util.concurrent.CompletableFuture.uniComposeStage(CompletableFuture.java:995)\\n\\tat
 
java.util.concurrent.CompletableFuture.thenCompose(CompletableFuture.java:2137)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitJob(EmbeddedExecutor.java:158)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.submitAndGetJobClientFuture(EmbeddedExecutor.java:119)\\n\\tat
 
org.apache.flink.client.deployment.application.executors.EmbeddedExecutor.execute(EmbeddedExecutor.java:98)\\n\\tat
 
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1812)\\n\\t...
 24 more\\nCaused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka://flink/user/rpc/dispatcher_1#-283770831]] after [60000 ms]. 
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A 
typical reason for `AskTimeoutException` is that the recipient actor didn't 
send a reply.\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)\\n\\tat 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)\\n\\tat
 akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)\\n\\tat 
scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)\\n\\tat
 
scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)\\n\\tat
 
scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)\\n\\tat
 
akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)\\n\\t...
 1 more\\n\
 {code}
 

> AskTimeoutException is thrown during job submission and completion
> ------------------------------------------------------------------
>
>                 Key: FLINK-11143
>                 URL: https://issues.apache.org/jira/browse/FLINK-11143
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.6.2, 1.10.0
>            Reporter: Alex Vinnik
>            Priority: Critical
>         Attachments: flink-job-timeline.PNG
>
>
> For more details please see the thread
> [http://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%3cc2fb26f9-1410-4333-80f4-34807481b...@gmail.com%3E]
> On submission 
> 2018-12-12 02:28:31 ERROR JobsOverviewHandler:92 - Implementation error: 
> Unhandled exception.
>  akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#225683351|#225683351]] after [10000 ms]. 
> Sender[null] sent message of type 
> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>  at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>  at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>  at java.lang.Thread.run(Thread.java:748)
>  
> On completion
>  
> {"errors":["Internal server error.","<Exception on server 
> side:\njava.util.concurrent.CompletionException: 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms]. 
> Sender[null] sent message of type 
> \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)\nCaused by: 
> akka.pattern.AskTimeoutException: Ask timed out on 
> [Actor[akka://flink/user/dispatcher#105638574]] after [10000 ms]. 
> Sender[null] sent message of type 
> \"org.apache.flink.runtime.rpc.messages.LocalFencedMessage\".
> at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)\n\t...
>  9 more\n\nEnd of exception on server side>"]}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to