Gary, thanks a lot. web.timeout seems to help. now I ran into a diff issue with loading the checkpoint. will take that separately.
On Thu, Jan 10, 2019 at 12:25 PM Gary Yao <g...@da-platform.com> wrote: > Hi all, > > I think increasing the default value of the config option web.timeout [1] > is > what you are looking for. > > Best, > Gary > > [1] > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/RestHandlerConfiguration.java#L76 > [2] > https://github.com/apache/flink/blob/a07ce7f6c88dc7d0c0d2ba55a0ab3f2283bf247c/flink-core/src/main/java/org/apache/flink/configuration/WebOptions.java#L177 > > On Thu, Jan 10, 2019 at 9:19 PM Aaron Levin <aaronle...@stripe.com> wrote: > >> We are also experiencing this! Thanks for speaking up! It's relieving to >> know we're not alone :) >> >> We tried adding `akka.ask.timeout: 1 min` to our `flink-conf.yaml`, which >> did not seem to have any effect. I tried adding every other related akka, >> rpc, etc. timeout and still continue to encounter these errors. I believe >> they may also impact our ability to deploy (as we get a timeout when >> submitting the job programmatically). I'd love to see a solution to this if >> one exists! >> >> Best, >> >> Aaron Levin >> >> On Thu, Jan 10, 2019 at 2:58 PM Steven Wu <stevenz...@gmail.com> wrote: >> >>> We are trying out Flink 1.7.0. We always get this exception when >>> submitting a job with external checkpoint via REST. Job parallelism is >>> 1,600. state size is probably in the range of 1-5 TBs. Job is actually >>> started. Just REST api returns this failure. >>> >>> If we submitting the job without external checkpoint, everything works >>> fine. >>> >>> Anyone else see such problem with 1.7? Appreciate your help! >>> >>> Thanks, >>> Steven >>> >>> org.apache.flink.runtime.rest.handler.RestHandlerException: >>> akka.pattern.AskTimeoutException: Ask timed out on >>> [Actor[akka://flink/user/dispatcher#-641142843]] after [10000 ms]. >>> Sender[null] sent message of type >>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". >>> at >>> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.lambda$handleRequest$4(JarRunHandler.java:114) >>> at >>> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870) >>> at >>> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852) >>> at >>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) >>> at >>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) >>> at >>> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772) >>> at akka.dispatch.OnComplete.internal(Future.scala:258) >>> at akka.dispatch.OnComplete.internal(Future.scala:256) >>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) >>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) >>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >>> at >>> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) >>> at >>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) >>> at >>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) >>> at >>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) >>> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) >>> at >>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) >>> at >>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) >>> at >>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) >>> at >>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) >>> at >>> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) >>> at >>> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) >>> at >>> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) >>> at java.lang.Thread.run(Thread.java:748) >>> Caused by: java.util.concurrent.CompletionException: >>> akka.pattern.AskTimeoutException: Ask timed out on >>> [Actor[akka://flink/user/dispatcher#-641142843]] after [10000 ms]. >>> Sender[null] sent message of type >>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". >>> at >>> java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326) >>> at >>> java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338) >>> at >>> java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911) >>> at >>> java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899) >>> ... 21 more >>> Caused by: akka.pattern.AskTimeoutException: Ask timed out on >>> [Actor[akka://flink/user/dispatcher#-641142843]] after [10000 ms]. >>> Sender[null] sent message of type >>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". >>> at >>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) >>> ... 9 more >>> >>