Till,

Once our job was restarted for some reason (e.g. taskmangaer container got
killed), it can stuck in continuous restart loop for hours. Right now, I
suspect it is caused by GC pause during restart, our job has very high
memory allocation in steady state. High GC pause then caused akka timeout,
which then caused jobmanager to think taksmanager containers are
unhealthy/dead and kill them. And the cycle repeats...

But I hasn't been able to prove or disprove it yet. When I was asking the
question, I was still sifting through metrics and error logs.

Thanks,
Steven


On Tue, Aug 22, 2017 at 1:21 AM, Till Rohrmann <till.rohrm...@gmail.com>
wrote:

> Hi Steven,
>
> quick correction for Flink 1.2. Indeed the MetricFetcher does not pick up
> the right timeout value from the configuration. Instead it uses a hardcoded
> 10s timeout. This has only been changed recently and is already committed
> in the master. So with the next release 1.4 it will properly pick up the
> right timeout settings.
>
> Just out of curiosity, what's the instability issue you're observing?
>
> Cheers,
> Till
>
> On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu <stevenz...@gmail.com> wrote:
>
>> Till/Chesnay, thanks for the answers. Look like this is a result/symptom
>> of underline stability issue that I am trying to track down.
>>
>> It is Flink 1.2.
>>
>> On Fri, Aug 18, 2017 at 12:24 AM, Chesnay Schepler <ches...@apache.org>
>> wrote:
>>
>>> The MetricFetcher always use the default akka timeout value.
>>>
>>>
>>> On 18.08.2017 09:07, Till Rohrmann wrote:
>>>
>>> Hi Steven,
>>>
>>> I thought that the MetricFetcher picks up the right timeout from the
>>> configuration. Which version of Flink are you using?
>>>
>>> The timeout is not a critical problem for the job health.
>>>
>>> Cheers,
>>> Till
>>>
>>> On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu <stevenz...@gmail.com> wrote:
>>>
>>>>
>>>> We have set akka.ask.timeout to 60 s in yaml file. I also confirmed the
>>>> setting in Flink UI. But I saw akka timeout of 10 s for metric query
>>>> service. two questions
>>>> 1) why doesn't metric query use the 60 s value configured in yaml file?
>>>> does it always use default 10 s value?
>>>> 2) could this cause heartbeat failure between task manager and job
>>>> manager? or is this jut non-critical failure that won't affect job health?
>>>>
>>>> Thanks,
>>>> Steven
>>>>
>>>> 2017-08-17 23:34:33,421 WARN 
>>>> org.apache.flink.runtime.webmonitor.metrics.MetricFetcher
>>>> - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask timed
>>>> out on [Actor[akka.tcp://flink@1.2.3.4:39139/user/MetricQueryServic
>>>> e_23cd9db754bb7d123d80e6b1c0be21d6]] after [10000 ms] at
>>>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:334)
>>>> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at
>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
>>>> at 
>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>>> at 
>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
>>>> at 
>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
>>>> at 
>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
>>>> at 
>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
>>>> at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
>>>> at java.lang.Thread.run(Thread.java:748)
>>>>
>>>
>>>
>>>
>>
>

Reply via email to