Hi Gary,

Turns out, the configuration warning you mentioned was the key.  The
akka.ask.timeout requires a duration unit, but the web.timeout setting is
looking for a long.  So the change I made earlier would not have applied
since it couldn't read `300s`.  Since making that change (`web.timeout:
300000`), I have not been able to reproduce the error - everything starts
successfully every time.  I do have debug logging turned on for now.  If it
happens again in the next couple of days, I will send details with debug
logs.

Thanks again for your help!
Greg

On Fri, Aug 31, 2018 at 3:21 PM Gary Yao <g...@data-artisans.com> wrote:

> Hi Greg,
>
> Unfortunately the environment information [1] is not logged. Can you set
> the
> log level for all Flink packages to DEBUG?
>
> Do you install Flink yourself on EMR, or do you use the pre-installed one?
> Can you show us the command with which you start the cluster/submit the
> job?
>
> I do not know if it is related but I found these warnings in your second
> log file:
>
>     2018-08-31 19:14:32 WARN
> org.apache.flink.configuration.Configuration  - Configuration cannot
> evaluate value 300s as a long integer number
>     2018-08-31 19:14:32 WARN
> org.apache.flink.configuration.Configuration  - Configuration cannot
> evaluate value 300s as a long integer number
>
> Best,
> Gary
>
> [1]
> https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281
>
> On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <finchgreg...@gmail.com>
> wrote:
>
>> Well ... that didn't take long.  The next time I tried, I got the Akka
>> timeout again.  Attached are the logs from the last attempt.  They're very
>> similar to the other logs I sent.
>>
>> On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <finchgreg...@gmail.com>
>> wrote:
>>
>>> Thanks Gary.  Attached is the jobmanager log.  You are correct that this
>>> is running on YARN.  I changed web.timeout as you suggested - that seems to
>>> be working the few times I tested it.  This problem comes and goes though -
>>> sometimes it starts before it times out.  I'll keep the web.timeout setting
>>> and reply again if the problem comes up again.  Thanks again for your quick
>>> response!
>>>
>>> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <g...@data-artisans.com> wrote:
>>>
>>>> Hi Greg,
>>>>
>>>> Can you describe the steps to reproduce the problem, or can you attach
>>>> the
>>>> full jobmanager logs? Because JobExecutionResultHandler appears in your
>>>> log, I
>>>> assume that you are starting a job cluster on YARN. Without seeing the
>>>> complete logs, I cannot be sure what exactly happens. For now, you can
>>>> try
>>>> setting the config option web.timeout to a higher value.
>>>>
>>>> Best,
>>>> Gary
>>>>
>>>> On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <finchgreg...@gmail.com>
>>>> wrote:
>>>>
>>>>> I'm having a problem with akka timeout when starting my cluster.  The
>>>>> error is "Ask timed out after 10000 ms.".  I have changed the
>>>>> akka.ask.timeout config setting to be 300000 ms, but it still times out 
>>>>> and
>>>>> fails after 10 seconds.  I confirmed that the config is properly set by
>>>>> both checking the Job Manager configuration tab (it shows 300000 ms) as
>>>>> well logging the output of AkkaUtils.getTimeout(configuration) which also
>>>>> shows 300000ms.  It seems something is not honoring that configuration
>>>>> value.
>>>>>
>>>>> I did find a different thread that discussed the fact that the
>>>>> LocalStreamEnvironment will not honor this setting, but that is not my
>>>>> case.  I am running on a cluster (AWS EMR) using the regular
>>>>> StreamExecutionEnvironment.  This is Flink 1.5.2.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> ~~~~~
>>>>>
>>>>> 2018-08-31 17:37:55 INFO  
>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new 
>>>>> token for : ip-10-213-139-66.ec2.internal:8041
>>>>> 2018-08-31 17:37:55 INFO  
>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new 
>>>>> token for : ip-10-213-136-25.ec2.internal:8041
>>>>> 2018-08-31 17:38:34 ERROR 
>>>>> o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - 
>>>>> Implementation error: Unhandled exception.
>>>>> akka.pattern.AskTimeoutException: Ask timed out on 
>>>>> [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. 
>>>>> Sender[null] sent message of type 
>>>>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>>>>>   at 
>>>>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>>>>>   at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>>>>>   at 
>>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>>>>>   at 
>>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>>>>   at 
>>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>>>>>   at 
>>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>>>>>   at 
>>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>>>>>   at 
>>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>>>>>   at 
>>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>>>>>   at java.lang.Thread.run(Thread.java:748)
>>>>> 2018-08-31 17:38:41 INFO  
>>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for 
>>>>> application to be successfully unregistered.
>>>>> 2018-08-31 17:38:41 INFO  
>>>>> o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted 
>>>>> while waiting for queue
>>>>> java.lang.InterruptedException: null
>>>>>   at 
>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>>>>>   at 
>>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>>>>>   at 
>>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>>>>>   at 
>>>>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>>>>> 2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor 
>>>>> flink-akka.remote.default-remote-dispatcher-81 - Association with remote 
>>>>> system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has 
>>>>> failed, address is now gated for [50] ms. Reason: [Disassociated]
>>>>>
>>>>>
>>>>>
>>>>
>

Reply via email to