Hi Greg, Unfortunately the environment information [1] is not logged. Can you set the log level for all Flink packages to DEBUG?
Do you install Flink yourself on EMR, or do you use the pre-installed one? Can you show us the command with which you start the cluster/submit the job? I do not know if it is related but I found these warnings in your second log file: 2018-08-31 19:14:32 WARN org.apache.flink.configuration.Configuration - Configuration cannot evaluate value 300s as a long integer number 2018-08-31 19:14:32 WARN org.apache.flink.configuration.Configuration - Configuration cannot evaluate value 300s as a long integer number Best, Gary [1] https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281 On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <finchgreg...@gmail.com> wrote: > Well ... that didn't take long. The next time I tried, I got the Akka > timeout again. Attached are the logs from the last attempt. They're very > similar to the other logs I sent. > > On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <finchgreg...@gmail.com> wrote: > >> Thanks Gary. Attached is the jobmanager log. You are correct that this >> is running on YARN. I changed web.timeout as you suggested - that seems to >> be working the few times I tested it. This problem comes and goes though - >> sometimes it starts before it times out. I'll keep the web.timeout setting >> and reply again if the problem comes up again. Thanks again for your quick >> response! >> >> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <g...@data-artisans.com> wrote: >> >>> Hi Greg, >>> >>> Can you describe the steps to reproduce the problem, or can you attach >>> the >>> full jobmanager logs? Because JobExecutionResultHandler appears in your >>> log, I >>> assume that you are starting a job cluster on YARN. Without seeing the >>> complete logs, I cannot be sure what exactly happens. For now, you can >>> try >>> setting the config option web.timeout to a higher value. >>> >>> Best, >>> Gary >>> >>> On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <finchgreg...@gmail.com> >>> wrote: >>> >>>> I'm having a problem with akka timeout when starting my cluster. The >>>> error is "Ask timed out after 10000 ms.". I have changed the >>>> akka.ask.timeout config setting to be 300000 ms, but it still times out and >>>> fails after 10 seconds. I confirmed that the config is properly set by >>>> both checking the Job Manager configuration tab (it shows 300000 ms) as >>>> well logging the output of AkkaUtils.getTimeout(configuration) which >>>> also shows 300000ms. It seems something is not honoring that configuration >>>> value. >>>> >>>> I did find a different thread that discussed the fact that the >>>> LocalStreamEnvironment will not honor this setting, but that is not my >>>> case. I am running on a cluster (AWS EMR) using the regular >>>> StreamExecutionEnvironment. This is Flink 1.5.2. >>>> >>>> Any ideas? >>>> >>>> ~~~~~ >>>> >>>> 2018-08-31 17:37:55 INFO >>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new >>>> token for : ip-10-213-139-66.ec2.internal:8041 >>>> 2018-08-31 17:37:55 INFO >>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new >>>> token for : ip-10-213-136-25.ec2.internal:8041 >>>> 2018-08-31 17:38:34 ERROR >>>> o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler - >>>> Implementation error: Unhandled exception. >>>> akka.pattern.AskTimeoutException: Ask timed out on >>>> [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. >>>> Sender[null] sent message of type >>>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". >>>> at >>>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) >>>> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) >>>> at >>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) >>>> at >>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) >>>> at >>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) >>>> at >>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) >>>> at >>>> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) >>>> at >>>> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) >>>> at >>>> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) >>>> at java.lang.Thread.run(Thread.java:748) >>>> 2018-08-31 17:38:41 INFO >>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Waiting for >>>> application to be successfully unregistered. >>>> 2018-08-31 17:38:41 INFO >>>> o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - Interrupted >>>> while waiting for queue >>>> java.lang.InterruptedException: null >>>> at >>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) >>>> at >>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) >>>> at >>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) >>>> at >>>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323) >>>> 2018-08-31 17:38:42 WARN akka.remote.ReliableDeliverySupervisor >>>> flink-akka.remote.default-remote-dispatcher-81 - Association with remote >>>> system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, >>>> address is now gated for [50] ms. Reason: [Disassociated] >>>> >>>> >>>> >>>