Re: Job Manager becomes irresponsive if the size of the session cluster grows

Biao Liu Wed, 17 Jul 2019 23:55:30 -0700

Hi,

It seems to be good based on your GC metrics. You could double check the GC
log if you enable it. The GC log is more direct.
I'm not sure what's happening in your JobManager. But I'm pretty sure that
Flink could support far more larger scale cluster than yours.


Have you ever checked the log file of JobManager? Is there any suspicious
warning or error log?
Have you ever tried some analytic tools to check the internal state of
JobManager, like jstack.

It's hard to do a deeper analysis based on current informations. It might
be helpful if you could provide more details.


Prakhar Mathur <prakha...@go-jek.com> 于2019年7月18日周四 下午2:12写道：

> Hi,
>
> We are using v1.6.2, currently, the number of TaskManagers are 70. We have
> the GC metrics on a dashboard also. Sum of
> Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is
> somewhere between 75 to 125
> and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.
>
> On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <mmyy1...@gmail.com> wrote:
>
>> Hi Prakhar,
>>
>> Have you ever checked the garbage collection of master?
>> Which version of Flink are you using? How many TaskManagers in your
>> cluster?
>>
>>
>> Prakhar Mathur <prakha...@go-jek.com> 于2019年7月18日周四 下午1:54写道：
>>
>>> Hello,
>>>
>>> We have deployed multiple Flink clusters on Kubernetess with 1 replica
>>> of Jobmanager and multiple of Taskmanager as per the requirement. Recently
>>> we are observing that on increasing the number of Taskmanagers for a
>>> cluster, the Jobmanager becomes irresponsive. It stops sending statsd
>>> metric for some irregular interval. Even the Jobmanager pod keeps
>>> restarting because it stops responding to the liveliness probe which
>>> results in Kubernetes killing the pod. We tried increasing the resources
>>> given(CPU, RAM) but it didn't help.
>>>
>>> Regards
>>> Prakhar Mathur
>>> Product Engineer
>>> GO-JEK
>>>
>>

Re: Job Manager becomes irresponsive if the size of the session cluster grows

Reply via email to