Hi Greg,

Unfortunately the environment information [1] is not logged. Can you set the
log level for all Flink packages to DEBUG?

Do you install Flink yourself on EMR, or do you use the pre-installed one?
Can you show us the command with which you start the cluster/submit the job?

I do not know if it is related but I found these warnings in your second
log file:

    2018-08-31 19:14:32 WARN  org.apache.flink.configuration.Configuration
- Configuration cannot evaluate value 300s as a long integer number
    2018-08-31 19:14:32 WARN  org.apache.flink.configuration.Configuration
- Configuration cannot evaluate value 300s as a long integer number

Best,
Gary

[1]
https://github.com/apache/flink/blob/9ae5009b6a82248bfae99dac088c1f6e285aa70f/flink-runtime/src/main/java/org/apache/flink/runtime/util/EnvironmentInformation.java#L281

On Fri, Aug 31, 2018 at 9:18 PM, Greg Finch <finchgreg...@gmail.com> wrote:

> Well ... that didn't take long.  The next time I tried, I got the Akka
> timeout again.  Attached are the logs from the last attempt.  They're very
> similar to the other logs I sent.
>
> On Fri, Aug 31, 2018 at 2:04 PM Greg Finch <finchgreg...@gmail.com> wrote:
>
>> Thanks Gary.  Attached is the jobmanager log.  You are correct that this
>> is running on YARN.  I changed web.timeout as you suggested - that seems to
>> be working the few times I tested it.  This problem comes and goes though -
>> sometimes it starts before it times out.  I'll keep the web.timeout setting
>> and reply again if the problem comes up again.  Thanks again for your quick
>> response!
>>
>> On Fri, Aug 31, 2018 at 1:38 PM Gary Yao <g...@data-artisans.com> wrote:
>>
>>> Hi Greg,
>>>
>>> Can you describe the steps to reproduce the problem, or can you attach
>>> the
>>> full jobmanager logs? Because JobExecutionResultHandler appears in your
>>> log, I
>>> assume that you are starting a job cluster on YARN. Without seeing the
>>> complete logs, I cannot be sure what exactly happens. For now, you can
>>> try
>>> setting the config option web.timeout to a higher value.
>>>
>>> Best,
>>> Gary
>>>
>>> On Fri, Aug 31, 2018 at 8:01 PM, Greg Finch <finchgreg...@gmail.com>
>>> wrote:
>>>
>>>> I'm having a problem with akka timeout when starting my cluster.  The
>>>> error is "Ask timed out after 10000 ms.".  I have changed the
>>>> akka.ask.timeout config setting to be 300000 ms, but it still times out and
>>>> fails after 10 seconds.  I confirmed that the config is properly set by
>>>> both checking the Job Manager configuration tab (it shows 300000 ms) as
>>>> well logging the output of AkkaUtils.getTimeout(configuration) which
>>>> also shows 300000ms.  It seems something is not honoring that configuration
>>>> value.
>>>>
>>>> I did find a different thread that discussed the fact that the
>>>> LocalStreamEnvironment will not honor this setting, but that is not my
>>>> case.  I am running on a cluster (AWS EMR) using the regular
>>>> StreamExecutionEnvironment.  This is Flink 1.5.2.
>>>>
>>>> Any ideas?
>>>>
>>>> ~~~~~
>>>>
>>>> 2018-08-31 17:37:55 INFO  
>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new 
>>>> token for : ip-10-213-139-66.ec2.internal:8041
>>>> 2018-08-31 17:37:55 INFO  
>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Received new 
>>>> token for : ip-10-213-136-25.ec2.internal:8041
>>>> 2018-08-31 17:38:34 ERROR 
>>>> o.a.flink.runtime.rest.handler.job.JobExecutionResultHandler  - 
>>>> Implementation error: Unhandled exception.
>>>> akka.pattern.AskTimeoutException: Ask timed out on 
>>>> [Actor[akka://flink/user/dispatcher#-219618710]] after [10000 ms]. 
>>>> Sender[null] sent message of type 
>>>> "org.apache.flink.runtime.rpc.messages.LocalFencedMessage".
>>>>    at 
>>>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>>>>    at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>>>>    at 
>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>>>>    at 
>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>>>    at 
>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>>>>    at 
>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
>>>>    at 
>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
>>>>    at 
>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
>>>>    at 
>>>> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
>>>>    at java.lang.Thread.run(Thread.java:748)
>>>> 2018-08-31 17:38:41 INFO  
>>>> org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl  - Waiting for 
>>>> application to be successfully unregistered.
>>>> 2018-08-31 17:38:41 INFO  
>>>> o.a.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl  - Interrupted 
>>>> while waiting for queue
>>>> java.lang.InterruptedException: null
>>>>    at 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>>>>    at 
>>>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
>>>>    at 
>>>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>>>>    at 
>>>> org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:323)
>>>> 2018-08-31 17:38:42 WARN  akka.remote.ReliableDeliverySupervisor 
>>>> flink-akka.remote.default-remote-dispatcher-81 - Association with remote 
>>>> system [akka.tcp://flink@ip-10-213-142-102.ec2.internal:42027] has failed, 
>>>> address is now gated for [50] ms. Reason: [Disassociated]
>>>>
>>>>
>>>>
>>>

Reply via email to