Quick question Steven. Where did you find the documentation concerning that
the death watch interval is linke to the akka ask timeout? It was included
in the past, but I couldn't find it anymore.

Cheers,
Till

On Mon, Sep 25, 2017 at 9:47 AM, Till Rohrmann <trohrm...@apache.org> wrote:

> Great to hear that you could figure things out Steven.
>
> You are right. The death watch is no longer linked to the akka ask
> timeout, because of FLINK-6495. Thanks for the feedback. I will correct the
> documentation.
>
> Cheers,
> Till
>
> On Sat, Sep 23, 2017 at 10:24 AM, Steven Wu <stevenz...@gmail.com> wrote:
>
>> just to close the thread. akka death watch was triggered by high GC
>> pause, which is caused by memory leak in our code during Flink job restart.
>>
>> noted that akka.ask.timeout wasn't related to akka death watch, which
>> Flink has documented and linked.
>>
>> On Sat, Aug 26, 2017 at 10:58 AM, Steven Wu <stevenz...@gmail.com> wrote:
>>
>>> this is a stateless job. so we don't use RocksDB.
>>>
>>> yeah. network can also be a possibility. will keep it in the radar.
>>> unfortunately, our metrics system don't have the tcp metrics when running
>>> inside containers.
>>>
>>> On Fri, Aug 25, 2017 at 2:09 PM, Robert Metzger <rmetz...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>> are you using the RocksDB state backend already?
>>>> Maybe writing the state to disk would actually reduce the pressure on
>>>> the GC (but of course it'll also reduce throughput a bit).
>>>>
>>>> Are there any known issues with the network? Maybe the network bursts
>>>> on restart cause the timeouts?
>>>>
>>>>
>>>> On Fri, Aug 25, 2017 at 6:17 PM, Steven Wu <stevenz...@gmail.com>
>>>> wrote:
>>>>
>>>>> Bowen,
>>>>>
>>>>> Heap size is ~50G. CPU was actually pretty low (like <20%) when high
>>>>> GC pause and akka timeout was happening. So maybe memory allocation and GC
>>>>> wasn't really an issue. I also recently learned that JVM can pause for
>>>>> writing to GC log for disk I/O. that is another lead I am pursuing.
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> On Wed, Aug 23, 2017 at 10:58 AM, Bowen Li <bowen...@offerupnow.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Steven,
>>>>>>     Yes, GC is a big overhead, it may cause your CPU utilization to
>>>>>> reach 100%, and every process stopped working. We ran into this a while 
>>>>>> too.
>>>>>>
>>>>>>     How much memory did you assign to TaskManager? How much the your
>>>>>> CPU utilization when your taskmanager is considered 'killed'?
>>>>>>
>>>>>> Bowen
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 23, 2017 at 10:01 AM, Steven Wu <stevenz...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Till,
>>>>>>>
>>>>>>> Once our job was restarted for some reason (e.g. taskmangaer
>>>>>>> container got killed), it can stuck in continuous restart loop for 
>>>>>>> hours.
>>>>>>> Right now, I suspect it is caused by GC pause during restart, our job 
>>>>>>> has
>>>>>>> very high memory allocation in steady state. High GC pause then caused 
>>>>>>> akka
>>>>>>> timeout, which then caused jobmanager to think taksmanager containers 
>>>>>>> are
>>>>>>> unhealthy/dead and kill them. And the cycle repeats...
>>>>>>>
>>>>>>> But I hasn't been able to prove or disprove it yet. When I was
>>>>>>> asking the question, I was still sifting through metrics and error logs.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Steven
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Aug 22, 2017 at 1:21 AM, Till Rohrmann <
>>>>>>> till.rohrm...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Steven,
>>>>>>>>
>>>>>>>> quick correction for Flink 1.2. Indeed the MetricFetcher does not
>>>>>>>> pick up the right timeout value from the configuration. Instead it 
>>>>>>>> uses a
>>>>>>>> hardcoded 10s timeout. This has only been changed recently and is 
>>>>>>>> already
>>>>>>>> committed in the master. So with the next release 1.4 it will properly 
>>>>>>>> pick
>>>>>>>> up the right timeout settings.
>>>>>>>>
>>>>>>>> Just out of curiosity, what's the instability issue you're
>>>>>>>> observing?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Till
>>>>>>>>
>>>>>>>> On Fri, Aug 18, 2017 at 7:07 PM, Steven Wu <stevenz...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Till/Chesnay, thanks for the answers. Look like this is a
>>>>>>>>> result/symptom of underline stability issue that I am trying to track 
>>>>>>>>> down.
>>>>>>>>>
>>>>>>>>> It is Flink 1.2.
>>>>>>>>>
>>>>>>>>> On Fri, Aug 18, 2017 at 12:24 AM, Chesnay Schepler <
>>>>>>>>> ches...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> The MetricFetcher always use the default akka timeout value.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 18.08.2017 09:07, Till Rohrmann wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Steven,
>>>>>>>>>>
>>>>>>>>>> I thought that the MetricFetcher picks up the right timeout from
>>>>>>>>>> the configuration. Which version of Flink are you using?
>>>>>>>>>>
>>>>>>>>>> The timeout is not a critical problem for the job health.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Till
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 18, 2017 at 7:22 AM, Steven Wu <stevenz...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> We have set akka.ask.timeout to 60 s in yaml file. I also
>>>>>>>>>>> confirmed the setting in Flink UI. But I saw akka timeout of 10 s 
>>>>>>>>>>> for
>>>>>>>>>>> metric query service. two questions
>>>>>>>>>>> 1) why doesn't metric query use the 60 s value configured in
>>>>>>>>>>> yaml file? does it always use default 10 s value?
>>>>>>>>>>> 2) could this cause heartbeat failure between task manager and
>>>>>>>>>>> job manager? or is this jut non-critical failure that won't affect 
>>>>>>>>>>> job
>>>>>>>>>>> health?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Steven
>>>>>>>>>>>
>>>>>>>>>>> 2017-08-17 23:34:33,421 WARN 
>>>>>>>>>>> org.apache.flink.runtime.webmonitor.metrics.MetricFetcher
>>>>>>>>>>> - Fetching metrics failed. akka.pattern.AskTimeoutException:
>>>>>>>>>>> Ask timed out on [Actor[akka.tcp://flink@1.2.3.4
>>>>>>>>>>> :39139/user/MetricQueryService_23cd9db754bb7d123d80e6b1c0be21d6]]
>>>>>>>>>>> after [10000 ms] at akka.pattern.PromiseActorRef$$
>>>>>>>>>>> anonfun$1.apply$mcV$sp(AskSupport.scala:334) at
>>>>>>>>>>> akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at
>>>>>>>>>>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
>>>>>>>>>>> at 
>>>>>>>>>>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
>>>>>>>>>>> at 
>>>>>>>>>>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
>>>>>>>>>>> at 
>>>>>>>>>>> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:474)
>>>>>>>>>>> at 
>>>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:425)
>>>>>>>>>>> at 
>>>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:429)
>>>>>>>>>>> at 
>>>>>>>>>>> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:381)
>>>>>>>>>>> at java.lang.Thread.run(Thread.java:748)
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to