I'm having a problem with akka timeout when starting my cluster. The error is "Ask timed out after 10000 ms.". I have changed the akka.ask.timeout config setting to be 300000 ms, but it still times out and fails after 10 seconds. I confirmed that the config is properly set by both checking the Job Manager configuration tab (it shows 300000 ms) as well logging the output of AkkaUtils.getTimeout(configuration) which also shows 300000ms. It seems something is not honoring that configuration value.
I did find a different thread that discussed the fact that the LocalStreamEnvironment will not honor this setting, but that is not my case. I am running on a cluster (AWS EMR) using the regular StreamExecutionEnvironment. This is Flink 1.5.2. Any ideas? ~~~~~ 2018-08-31 17:37:55 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for : ip-10-213-139-66.ec2.internal:8041 2018-08-31 17:37:55 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Received new token for : ip-10-213-136-25.ec2.internal:8041 2018-08-31 17:38:34 ERROR o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler - Implementation error: Unhandled exception. akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. Sender[null] sent message of type "org.apache.flink.runtime.rpc.messages.LocalFencedMessage". at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) at java.lang.Thread.run(Thread.java:748) 2018-08-31 17:38:41 INFO org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl - Waiting for application to be successfully unregistered. 2018-08-31 17:38:41 INFO o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl - Interrupted while waiting for queue java.lang.InterruptedException: null at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323) 2018-08-31 17:38:42 WARN akka.remote.ReliableDeliverySupervisor flink-akka.remote.default-remote-dispatcher-81 - Association with remote system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, address is now gated for [50] ms. Reason: [Disassociated]