Hi, It seems to be good based on your GC metrics. You could double check the GC log if you enable it. The GC log is more direct. I'm not sure what's happening in your JobManager. But I'm pretty sure that Flink could support far more larger scale cluster than yours.
Have you ever checked the log file of JobManager? Is there any suspicious warning or error log? Have you ever tried some analytic tools to check the internal state of JobManager, like jstack. It's hard to do a deeper analysis based on current informations. It might be helpful if you could provide more details. Prakhar Mathur <prakha...@go-jek.com> 于2019年7月18日周四 下午2:12写道: > Hi, > > We are using v1.6.2, currently, the number of TaskManagers are 70. We have > the GC metrics on a dashboard also. Sum of > Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is > somewhere between 75 to 125 > and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10. > > On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <mmyy1...@gmail.com> wrote: > >> Hi Prakhar, >> >> Have you ever checked the garbage collection of master? >> Which version of Flink are you using? How many TaskManagers in your >> cluster? >> >> >> Prakhar Mathur <prakha...@go-jek.com> 于2019年7月18日周四 下午1:54写道: >> >>> Hello, >>> >>> We have deployed multiple Flink clusters on Kubernetess with 1 replica >>> of Jobmanager and multiple of Taskmanager as per the requirement. Recently >>> we are observing that on increasing the number of Taskmanagers for a >>> cluster, the Jobmanager becomes irresponsive. It stops sending statsd >>> metric for some irregular interval. Even the Jobmanager pod keeps >>> restarting because it stops responding to the liveliness probe which >>> results in Kubernetes killing the pod. We tried increasing the resources >>> given(CPU, RAM) but it didn't help. >>> >>> Regards >>> Prakhar Mathur >>> Product Engineer >>> GO-JEK >>> >>