Hi Steven, Yes, GC is a big overhead, it may cause your CPU utilization to reach 100%, and every process stopped working. We ran into this a while too.
How much memory did you assign to TaskManager? How much the your CPU utilization when your taskmanager is considered 'killed'? Bowen On Wed, Aug 23, 2017 at 10:01 AM, Steven Wu <stevenz...@gmail.com> wrote: > Till, > > Once our job was restarted for some reason (e.g. taskmangaer container got > killed), it can stuck in continuous restart loop for hours. Right now, I > suspect it is caused by GC pause during restart, our job has very high > memory allocation in steady state. High GC pause then caused akka timeout, > which then caused jobmanager to think taksmanager containers are > unhealthy/dead and kill them. And the cycle repeats... > > But I hasn't been able to prove or disprove it yet. When I was asking the > question, I was still sifting through metrics and error logs. > > Thanks, > Steven > > > On Tue, Aug 22, 2017 at 1:21 AM, Till Rohrmann <till.rohrm...@gmail.com> > wrote: > >> Hi Steven, >> >> quick correction for Flink 1.2. Indeed the MetricFetcher does not pick up >> the right timeout value from the configuration. Instead it uses a hardcoded >> 10s timeout. This has only been changed recently and is already committed >> in the master. So with the next release 1.4 it will properly pick up the >> right timeout settings. >> >> Just out of curiosity, what's the instability issue you're observing? >> >> Cheers, >> Till >> >> On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu <stevenz...@gmail.com> wrote: >> >>> Till/Chesnay, thanks for the answers. Look like this is a result/symptom >>> of underline stability issue that I am trying to track down. >>> >>> It is Flink 1.2. >>> >>> On Fri, Aug 18, 2017 at 12:24 AM, Chesnay Schepler <ches...@apache.org> >>> wrote: >>> >>>> The MetricFetcher always use the default akka timeout value. >>>> >>>> >>>> On 18.08.2017 09:07, Till Rohrmann wrote: >>>> >>>> Hi Steven, >>>> >>>> I thought that the MetricFetcher picks up the right timeout from the >>>> configuration. Which version of Flink are you using? >>>> >>>> The timeout is not a critical problem for the job health. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu <stevenz...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> We have set akka.ask.timeout to 60 s in yaml file. I also confirmed >>>>> the setting in Flink UI. But I saw akka timeout of 10 s for metric query >>>>> service. two questions >>>>> 1) why doesn't metric query use the 60 s value configured in yaml >>>>> file? does it always use default 10 s value? >>>>> 2) could this cause heartbeat failure between task manager and job >>>>> manager? or is this jut non-critical failure that won't affect job health? >>>>> >>>>> Thanks, >>>>> Steven >>>>> >>>>> 2017-08-17 23:34:33,421 WARN >>>>> org.apache.flink.runtime.webmonitor.metrics.MetricFetcher >>>>> - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask >>>>> timed out on [Actor[akka.tcp://flink@1.2.3.4 >>>>> :39139/user/MetricQueryService_23cd9db754bb7d123d80e6b1c0be21d6]] >>>>> after [10000 ms] at akka.pattern.PromiseActorRef$$ >>>>> anonfun$1.apply$mcV$sp(AskSupport.scala:334) at >>>>> akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at >>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) >>>>> at >>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) >>>>> at >>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) >>>>> at >>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) >>>>> at >>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) >>>>> at >>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) >>>>> at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) >>>>> at java.lang.Thread.run(Thread.java:748) >>>>> >>>> >>>> >>>> >>> >> >