Hi Gary, Turns out, the configuration warning you mentioned was the key. The akka.ask.timeout requires a duration unit, but the web.timeout setting is looking for a long. So the change I made earlier would not have applied since it couldn't read `300s`. Since making that change (`web.timeout: 300000`), I have not been able to reproduce the error - everything starts successfully every time. I do have debug logging turned on for now. If it happens again in the next couple of days, I will send details with debug logs.
Thanks again for your help! Greg On Fri, Aug 31, 2018 at 3:21 PM Gary Yao <g...@data-artisans.com> wrote: > Hi Greg, > > Unfortunately the environment information [1] is not logged. Can you set > the > log level for all Flink packages to DEBUG? > > Do you install Flink yourself on EMR, or do you use the pre-installed one? > Can you show us the command with which you start the cluster/submit the > job? > > I do not know if it is related but I found these warnings in your second > log file: > > 2018-08-31 19:14:32 WARN > org.apache.flink.configuration.Configuration - Configuration cannot > evaluate value 300s as a long integer number > 2018-08-31 19:14:32 WARN > org.apache.flink.configuration.Configuration - Configuration cannot > evaluate value 300s as a long integer number > > Best, > Gary > > [1] > https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281 > > On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <finchgreg...@gmail.com> > wrote: > >> Well ... that didn't take long. The next time I tried, I got the Akka >> timeout again. Attached are the logs from the last attempt. They're very >> similar to the other logs I sent. >> >> On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <finchgreg...@gmail.com> >> wrote: >> >>> Thanks Gary. Attached is the jobmanager log. You are correct that this >>> is running on YARN. I changed web.timeout as you suggested - that seems to >>> be working the few times I tested it. This problem comes and goes though - >>> sometimes it starts before it times out. I'll keep the web.timeout setting >>> and reply again if the problem comes up again. Thanks again for your quick >>> response! >>> >>> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <g...@data-artisans.com> wrote: >>> >>>> Hi Greg, >>>> >>>> Can you describe the steps to reproduce the problem, or can you attach >>>> the >>>> full jobmanager logs? Because JobExecutionResultHandler appears in your >>>> log, I >>>> assume that you are starting a job cluster on YARN. Without seeing the >>>> complete logs, I cannot be sure what exactly happens. For now, you can >>>> try >>>> setting the config option web.timeout to a higher value. >>>> >>>> Best, >>>> Gary >>>> >>>> On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <finchgreg...@gmail.com> >>>> wrote: >>>> >>>>> I'm having a problem with akka timeout when starting my cluster. The >>>>> error is "Ask timed out after 10000 ms.". I have changed the >>>>> akka.ask.timeout config setting to be 300000 ms, but it still times out >>>>> and >>>>> fails after 10 seconds. I confirmed that the config is properly set by >>>>> both checking the Job Manager configuration tab (it shows 300000 ms) as >>>>> well logging the output of AkkaUtils.getTimeout(configuration) which also >>>>> shows 300000ms. It seems something is not honoring that configuration >>>>> value. >>>>> >>>>> I did find a different thread that discussed the fact that the >>>>> LocalStreamEnvironment will not honor this setting, but that is not my >>>>> case. I am running on a cluster (AWS EMR) using the regular >>>>> StreamExecutionEnvironment. This is Flink 1.5.2. >>>>> >>>>> Any ideas? >>>>> >>>>> ~~~~~ >>>>> >>>>> 2018-08-31 17:37:55 INFO >>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new >>>>> token for : ip-10-213-139-66.ec2.internal:8041 >>>>> 2018-08-31 17:37:55 INFO >>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new >>>>> token for : ip-10-213-136-25.ec2.internal:8041 >>>>> 2018-08-31 17:38:34 ERROR >>>>> o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler - >>>>> Implementation error: Unhandled exception. >>>>> akka.pattern.AskTimeoutException: Ask timed out on >>>>> [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. >>>>> Sender[null] sent message of type >>>>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". >>>>> at >>>>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) >>>>> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) >>>>> at >>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) >>>>> at >>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) >>>>> at >>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) >>>>> at >>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) >>>>> at >>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) >>>>> at >>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) >>>>> at >>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) >>>>> at java.lang.Thread.run(Thread.java:748) >>>>> 2018-08-31 17:38:41 INFO >>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Waiting for >>>>> application to be successfully unregistered. >>>>> 2018-08-31 17:38:41 INFO >>>>> o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - Interrupted >>>>> while waiting for queue >>>>> java.lang.InterruptedException: null >>>>> at >>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) >>>>> at >>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) >>>>> at >>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) >>>>> at >>>>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323) >>>>> 2018-08-31 17:38:42 WARN akka.remote.ReliableDeliverySupervisor >>>>> flink-akka.remote.default-remote-dispatcher-81 - Association with remote >>>>> system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has >>>>> failed, address is now gated for [50] ms. Reason: [Disassociated] >>>>> >>>>> >>>>> >>>> >