Hi Steven,
    Yes, GC is a big overhead, it may cause your CPU utilization to reach
100%, and every process stopped working. We ran into this a while too.

    How much memory did you assign to TaskManager? How much the your CPU
utilization when your taskmanager is considered 'killed'?

Bowen



On Wed, Aug 23, 2017 at 10:01 AM, Steven Wu <stevenz...@gmail.com> wrote:

> Till,
>
> Once our job was restarted for some reason (e.g. taskmangaer container got
> killed), it can stuck in continuous restart loop for hours. Right now, I
> suspect it is caused by GC pause during restart, our job has very high
> memory allocation in steady state. High GC pause then caused akka timeout,
> which then caused jobmanager to think taksmanager containers are
> unhealthy/dead and kill them. And the cycle repeats...
>
> But I hasn't been able to prove or disprove it yet. When I was asking the
> question, I was still sifting through metrics and error logs.
>
> Thanks,
> Steven
>
>
> On Tue, Aug 22, 2017 at 1:21 AM, Till Rohrmann <till.rohrm...@gmail.com>
> wrote:
>
>> Hi Steven,
>>
>> quick correction for Flink 1.2. Indeed the MetricFetcher does not pick up
>> the right timeout value from the configuration. Instead it uses a hardcoded
>> 10s timeout. This has only been changed recently and is already committed
>> in the master. So with the next release 1.4 it will properly pick up the
>> right timeout settings.
>>
>> Just out of curiosity, what's the instability issue you're observing?
>>
>> Cheers,
>> Till
>>
>> On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> Till/Chesnay, thanks for the answers. Look like this is a result/symptom
>>> of underline stability issue that I am trying to track down.
>>>
>>> It is Flink 1.2.
>>>
>>> On Fri, Aug 18, 2017 at 12:24 AM, Chesnay Schepler <ches...@apache.org>
>>> wrote:
>>>
>>>> The MetricFetcher always use the default akka timeout value.
>>>>
>>>>
>>>> On 18.08.2017 09:07, Till Rohrmann wrote:
>>>>
>>>> Hi Steven,
>>>>
>>>> I thought that the MetricFetcher picks up the right timeout from the
>>>> configuration. Which version of Flink are you using?
>>>>
>>>> The timeout is not a critical problem for the job health.
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu <stevenz...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> We have set akka.ask.timeout to 60 s in yaml file. I also confirmed
>>>>> the setting in Flink UI. But I saw akka timeout of 10 s for metric query
>>>>> service. two questions
>>>>> 1) why doesn't metric query use the 60 s value configured in yaml
>>>>> file? does it always use default 10 s value?
>>>>> 2) could this cause heartbeat failure between task manager and job
>>>>> manager? or is this jut non-critical failure that won't affect job health?
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> 2017-08-17 23:34:33,421 WARN 
>>>>> org.apache.flink.runtime.webmonitor.metrics.MetricFetcher
>>>>> - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask
>>>>> timed out on [Actor[akka.tcp://flink@1.2.3.4
>>>>> :39139/user/MetricQueryService_23cd9db754bb7d123d80e6b1c0be21d6]]
>>>>> after [10000 ms] at akka.pattern.PromiseActorRef$$
>>>>> anonfun$1.apply$mcV$sp(AskSupport.scala:334) at
>>>>> akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at
>>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
>>>>> at 
>>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>>>> at 
>>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
>>>>> at 
>>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
>>>>> at 
>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
>>>>> at 
>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
>>>>> at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to