Great to hear that you could figure things out Steven. You are right. The death watch is no longer linked to the akka ask timeout, because of FLINK-6495. Thanks for the feedback. I will correct the documentation.
Cheers, Till On Sat, Sep 23, 2017 at 10:24 AM, Steven Wu <stevenz...@gmail.com> wrote: > just to close the thread. akka death watch was triggered by high GC pause, > which is caused by memory leak in our code during Flink job restart. > > noted that akka.ask.timeout wasn't related to akka death watch, which > Flink has documented and linked. > > On Sat, Aug 26, 2017 at 10:58 AM, Steven Wu <stevenz...@gmail.com> wrote: > >> this is a stateless job. so we don't use RocksDB. >> >> yeah. network can also be a possibility. will keep it in the radar. >> unfortunately, our metrics system don't have the tcp metrics when running >> inside containers. >> >> On Fri, Aug 25, 2017 at 2:09 PM, Robert Metzger <rmetz...@apache.org> >> wrote: >> >>> Hi, >>> are you using the RocksDB state backend already? >>> Maybe writing the state to disk would actually reduce the pressure on >>> the GC (but of course it'll also reduce throughput a bit). >>> >>> Are there any known issues with the network? Maybe the network bursts on >>> restart cause the timeouts? >>> >>> >>> On Fri, Aug 25, 2017 at 6:17 PM, Steven Wu <stevenz...@gmail.com> wrote: >>> >>>> Bowen, >>>> >>>> Heap size is ~50G. CPU was actually pretty low (like <20%) when high GC >>>> pause and akka timeout was happening. So maybe memory allocation and GC >>>> wasn't really an issue. I also recently learned that JVM can pause for >>>> writing to GC log for disk I/O. that is another lead I am pursuing. >>>> >>>> Thanks, >>>> Steven >>>> >>>> On Wed, Aug 23, 2017 at 10:58 AM, Bowen Li <bowen...@offerupnow.com> >>>> wrote: >>>> >>>>> Hi Steven, >>>>> Yes, GC is a big overhead, it may cause your CPU utilization to >>>>> reach 100%, and every process stopped working. We ran into this a while >>>>> too. >>>>> >>>>> How much memory did you assign to TaskManager? How much the your >>>>> CPU utilization when your taskmanager is considered 'killed'? >>>>> >>>>> Bowen >>>>> >>>>> >>>>> >>>>> On Wed, Aug 23, 2017 at 10:01 AM, Steven Wu <stevenz...@gmail.com> >>>>> wrote: >>>>> >>>>>> Till, >>>>>> >>>>>> Once our job was restarted for some reason (e.g. taskmangaer >>>>>> container got killed), it can stuck in continuous restart loop for hours. >>>>>> Right now, I suspect it is caused by GC pause during restart, our job has >>>>>> very high memory allocation in steady state. High GC pause then caused >>>>>> akka >>>>>> timeout, which then caused jobmanager to think taksmanager containers are >>>>>> unhealthy/dead and kill them. And the cycle repeats... >>>>>> >>>>>> But I hasn't been able to prove or disprove it yet. When I was asking >>>>>> the question, I was still sifting through metrics and error logs. >>>>>> >>>>>> Thanks, >>>>>> Steven >>>>>> >>>>>> >>>>>> On Tue, Aug 22, 2017 at 1:21 AM, Till Rohrmann < >>>>>> till.rohrm...@gmail.com> wrote: >>>>>> >>>>>>> Hi Steven, >>>>>>> >>>>>>> quick correction for Flink 1.2. Indeed the MetricFetcher does not >>>>>>> pick up the right timeout value from the configuration. Instead it uses >>>>>>> a >>>>>>> hardcoded 10s timeout. This has only been changed recently and is >>>>>>> already >>>>>>> committed in the master. So with the next release 1.4 it will properly >>>>>>> pick >>>>>>> up the right timeout settings. >>>>>>> >>>>>>> Just out of curiosity, what's the instability issue you're observing? >>>>>>> >>>>>>> Cheers, >>>>>>> Till >>>>>>> >>>>>>> On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu <stevenz...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Till/Chesnay, thanks for the answers. Look like this is a >>>>>>>> result/symptom of underline stability issue that I am trying to track >>>>>>>> down. >>>>>>>> >>>>>>>> It is Flink 1.2. >>>>>>>> >>>>>>>> On Fri, Aug 18, 2017 at 12:24 AM, Chesnay Schepler < >>>>>>>> ches...@apache.org> wrote: >>>>>>>> >>>>>>>>> The MetricFetcher always use the default akka timeout value. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 18.08.2017 09:07, Till Rohrmann wrote: >>>>>>>>> >>>>>>>>> Hi Steven, >>>>>>>>> >>>>>>>>> I thought that the MetricFetcher picks up the right timeout from >>>>>>>>> the configuration. Which version of Flink are you using? >>>>>>>>> >>>>>>>>> The timeout is not a critical problem for the job health. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Till >>>>>>>>> >>>>>>>>> On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu <stevenz...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> We have set akka.ask.timeout to 60 s in yaml file. I also >>>>>>>>>> confirmed the setting in Flink UI. But I saw akka timeout of 10 s for >>>>>>>>>> metric query service. two questions >>>>>>>>>> 1) why doesn't metric query use the 60 s value configured in yaml >>>>>>>>>> file? does it always use default 10 s value? >>>>>>>>>> 2) could this cause heartbeat failure between task manager and >>>>>>>>>> job manager? or is this jut non-critical failure that won't affect >>>>>>>>>> job >>>>>>>>>> health? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Steven >>>>>>>>>> >>>>>>>>>> 2017-08-17 23:34:33,421 WARN >>>>>>>>>> org.apache.flink.runtime.webmonitor.metrics.MetricFetcher >>>>>>>>>> - Fetching metrics failed. akka.pattern.AskTimeoutException: Ask >>>>>>>>>> timed out on [Actor[akka.tcp://flink@1.2.3.4 >>>>>>>>>> :39139/user/MetricQueryService_23cd9db754bb7d123d80e6b1c0be21d6]] >>>>>>>>>> after [10000 ms] at akka.pattern.PromiseActorRef$$ >>>>>>>>>> anonfun$1.apply$mcV$sp(AskSupport.scala:334) at >>>>>>>>>> akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at >>>>>>>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599) >>>>>>>>>> at >>>>>>>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) >>>>>>>>>> at >>>>>>>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597) >>>>>>>>>> at >>>>>>>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474) >>>>>>>>>> at >>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425) >>>>>>>>>> at >>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429) >>>>>>>>>> at >>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381) >>>>>>>>>> at java.lang.Thread.run(Thread.java:748) >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >