Hello,

We run into the same problem. We've done most of the same
steps/observations:

   - increase memory
   - increase cpu
   - No noticable increase in GC activity
   - Little network io

Our current setup has the liveliness probe disabled and we've increased
(akka)timeouts, this seems to help a bit but it still feels very
unresponsive.


On Fri, Jul 26, 2019 at 12:41 PM Biao Liu <mmyy1...@gmail.com> wrote:

> Hi Prakhar,
>
> Sorry I don't have much experience on k8s. Maybe some other guys could
> help.
>
> On Fri, Jul 26, 2019 at 6:20 PM Prakhar Mathur <prakha...@go-jek.com>
> wrote:
>
>> Hi,
>>
>> So we were deploying our flink clusters on YARN earlier but then we moved
>> to kubernetes, but then our clusters were not this big. Have you guys seen
>> issues with job manager rest server becoming irresponsive on kubernetes
>> before?
>>
>> On Fri, Jul 26, 2019, 14:28 Biao Liu <mmyy1...@gmail.com> wrote:
>>
>>> Hi Prakhar,
>>>
>>> Sorry I could not find any abnormal message from your GC log and stack
>>> trace.
>>> Have you ever tried deploying the cluster in other ways? Not on
>>> Kubernetes. Like on YARN or standalone. Just for narrowing down the scope.
>>>
>>>
>>> On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur <prakha...@go-jek.com>
>>> wrote:
>>>
>>>>
>>>> On Mon, Jul 22, 2019, 16:08 Prakhar Mathur <prakha...@go-jek.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We enabled GC logging, here are the logs
>>>>>
>>>>> [GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)]
>>>>> 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6587039K->38642K(6763008K)]
>>>>> 7060930K->512614K(20809728K), 0.0740978 secs] [Times: user=0.08 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6502858K->192077K(6734848K)]
>>>>> 6976829K->666144K(20781568K), 0.0841759 secs] [Times: user=0.17 sys=0.00,
>>>>> real=0.09 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6647378K->50108K(6759424K)]
>>>>> 7121446K->524248K(20806144K), 0.0622997 secs] [Times: user=0.08 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6501890K->60606K(6779904K)]
>>>>> 6976029K->534961K(20826624K), 0.0637955 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6586046K->40411K(6768640K)]
>>>>> 7060401K->514839K(20815360K), 0.0729137 secs] [Times: user=0.08 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6543919K->51886K(6797824K)]
>>>>> 7018346K->526385K(20844544K), 0.0649143 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6601690K->191832K(6754304K)]
>>>>> 7076190K->666427K(20801024K), 0.1029686 secs] [Times: user=0.18 sys=0.00,
>>>>> real=0.10 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6742947K->62693K(6781952K)]
>>>>> 7217543K->537361K(20828672K), 0.0639272 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6589986K->66299K(6770688K)]
>>>>> 7064653K->541039K(20817408K), 0.0701853 secs] [Times: user=0.08 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6590742K->42995K(6800896K)]
>>>>> 7065481K->517798K(20847616K), 0.0595729 secs] [Times: user=0.08 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6608678K->66127K(6793728K)]
>>>>> 7083482K->541011K(20840448K), 0.0608270 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6608886K->62063K(6822400K)]
>>>>> 7083769K->537027K(20869120K), 0.0675917 secs] [Times: user=0.10 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6617146K->200674K(6812160K)]
>>>>> 7092110K->812325K(20858880K), 1.1685237 secs] [Times: user=3.53 sys=0.71,
>>>>> real=1.17 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6773610K->194848K(6633984K)]
>>>>> 7385261K->806700K(20680704K), 0.0858601 secs] [Times: user=0.19 sys=0.00,
>>>>> real=0.09 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6617888K->44002K(6723072K)]
>>>>> 7229740K->655854K(20769792K), 0.0647387 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6420043K->194672K(6702080K)]
>>>>> 7031895K->806604K(20748800K), 0.0833792 secs] [Times: user=0.18 sys=0.00,
>>>>> real=0.08 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6603376K->187059K(6596096K)]
>>>>> 7215308K->799063K(20642816K), 0.0906529 secs] [Times: user=0.17 sys=0.00,
>>>>> real=0.09 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6572690K->51850K(6715904K)]
>>>>> 7184694K->663990K(20762624K), 0.0837285 secs] [Times: user=0.11 sys=0.00,
>>>>> real=0.09 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6452380K->44766K(6708224K)]
>>>>> 7064519K->656993K(20754944K), 0.0809864 secs] [Times: user=0.10 sys=0.00,
>>>>> real=0.08 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6445790K->42654K(6730752K)]
>>>>> 7058017K->654961K(20777472K), 0.0686401 secs] [Times: user=0.08 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6473374K->41410K(6723072K)]
>>>>> 7085681K->654085K(20769792K), 0.0714228 secs] [Times: user=0.08 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6448507K->59031K(6752256K)]
>>>>> 7061182K->671786K(20798976K), 0.0599829 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6489819K->68484K(6739968K)]
>>>>> 7102574K->681619K(20786688K), 0.0674541 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6537251K->257001K(6625792K)]
>>>>> 7150386K->938268K(20672512K), 1.1775771 secs] [Times: user=4.11 sys=0.22,
>>>>> real=1.18 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6575609K->76090K(6695936K)]
>>>>> 7256876K->757765K(20742656K), 0.0805662 secs] [Times: user=0.11 sys=0.01,
>>>>> real=0.08 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6444858K->42242K(6719488K)]
>>>>> 7126533K->723917K(20766208K), 0.0611949 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6399332K->143985K(6710784K)]
>>>>> 7081006K->826084K(20757504K), 0.0771257 secs] [Times: user=0.16 sys=0.00,
>>>>> real=0.08 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6503905K->67207K(6736896K)]
>>>>> 7186004K->750758K(20783616K), 0.0808856 secs] [Times: user=0.11 sys=0.01,
>>>>> real=0.08 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6468318K->59200K(6724608K)]
>>>>> 7151869K->743099K(20771328K), 0.0709461 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6484294K->51134K(6757376K)]
>>>>> 7168193K->735120K(20804096K), 0.0772947 secs] [Times: user=0.11 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6503020K->77337K(6747648K)]
>>>>> 7187007K->761395K(20794368K), 0.0835827 secs] [Times: user=0.12 sys=0.00,
>>>>> real=0.09 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6505125K->66972K(6783488K)]
>>>>> 7189184K->751110K(20830208K), 0.0671168 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.06 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6544467K->201750K(6734336K)]
>>>>> 7228606K->885969K(20781056K), 0.0931485 secs] [Times: user=0.20 sys=0.00,
>>>>> real=0.10 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6733846K->43858K(6763520K)]
>>>>> 7418065K->728157K(20810240K), 0.0683046 secs] [Times: user=0.09 sys=0.00,
>>>>> real=0.07 secs]
>>>>> [GC (Allocation Failure) [PSYoungGen: 6525200K->66239K(6756864K)]
>>>>> 7209498K->750617K(20803584K), 0.0764187 secs] [Times: user=0.10 sys=0.00,
>>>>> real=0.08 secs]
>>>>>
>>>>>
>>>>> We were going through the thread dump, but couldn't find a reason for
>>>>> such behaviour
>>>>>
>>>>> "flink-rest-server-netty-worker-thread-8" - Thread t@126233
>>>>>    java.lang.Thread.State: RUNNABLE
>>>>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>         at
>>>>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>>>         at
>>>>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>>>>         at
>>>>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>>>>         - locked <2654bbbf> (a
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>>>>         - locked <45bd4356> (a java.util.Collections$UnmodifiableSet)
>>>>>         - locked <24d4aa0> (a sun.nio.ch.EPollSelectorImpl)
>>>>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>>>>         at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>>    Locked ownable synchronizers:
>>>>>         - None
>>>>>
>>>>> "flink-rest-server-netty-worker-thread-7" - Thread t@123834
>>>>>    java.lang.Thread.State: RUNNABLE
>>>>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>         at
>>>>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>>>         at
>>>>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>>>>         at
>>>>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>>>>         - locked <2a9b97d7> (a
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>>>>         - locked <29115878> (a java.util.Collections$UnmodifiableSet)
>>>>>         - locked <61909ff> (a sun.nio.ch.EPollSelectorImpl)
>>>>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>>>>         at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>>    Locked ownable synchronizers:
>>>>>         - None
>>>>>
>>>>> "flink-rest-server-netty-worker-thread-6" - Thread t@121404
>>>>>    java.lang.Thread.State: RUNNABLE
>>>>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>         at
>>>>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>>>         at
>>>>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>>>>         at
>>>>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>>>>         - locked <6ee94bd6> (a
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>>>>         - locked <183a4cf3> (a java.util.Collections$UnmodifiableSet)
>>>>>         - locked <3b4b1514> (a sun.nio.ch.EPollSelectorImpl)
>>>>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>>>>         at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>>    Locked ownable synchronizers:
>>>>>         - None
>>>>>
>>>>> "flink-rest-server-netty-worker-thread-5" - Thread t@118945
>>>>>    java.lang.Thread.State: RUNNABLE
>>>>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>         at
>>>>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>>>         at
>>>>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>>>>         at
>>>>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>>>>         - locked <7553b573> (a
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>>>>         - locked <51c0f7f4> (a java.util.Collections$UnmodifiableSet)
>>>>>         - locked <1bd774eb> (a sun.nio.ch.EPollSelectorImpl)
>>>>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>>>>         at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>>    Locked ownable synchronizers:
>>>>>         - None
>>>>>
>>>>> "flink-rest-server-netty-worker-thread-4" - Thread t@117073
>>>>>    java.lang.Thread.State: RUNNABLE
>>>>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>>>>         at
>>>>> sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>>>>         at
>>>>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>>>>         at
>>>>> sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>>>>         - locked <78572130> (a
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>>>>         - locked <1047fb64> (a java.util.Collections$UnmodifiableSet)
>>>>>         - locked <466ec1cf> (a sun.nio.ch.EPollSelectorImpl)
>>>>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>>>>         at
>>>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>>>>         at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>>    Locked ownable synchronizers:
>>>>>         - None
>>>>>
>>>>> "gcs-async-channel-pool-19" - Thread t@116997
>>>>>    java.lang.Thread.State: TIMED_WAITING
>>>>>         at sun.misc.Unsafe.park(Native Method)
>>>>>         - parking to wait for <1be52a4a> (a
>>>>> java.util.concurrent.SynchronousQueue$TransferStack)
>>>>>         at
>>>>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>>>>>         at
>>>>> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>>>>>         at
>>>>> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>>>>>         at
>>>>> java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>>>>>         at
>>>>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
>>>>>         at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>>>>>         at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>         at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>>    Locked ownable synchronizers:
>>>>>         - None
>>>>>
>>>>> "RMI TCP Connection(36)-127.0.0.1" - Thread t@114070
>>>>>    java.lang.Thread.State: TIMED_WAITING
>>>>>         at java.lang.Object.wait(Native Method)
>>>>>         - waiting on <22b14b0b> (a
>>>>> com.sun.jmx.remote.internal.ArrayNotificationBuffer)
>>>>>         at
>>>>> com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(ArrayNotificationBuffer.java:449)
>>>>>         at
>>>>> com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(ArrayNotificationBuffer.java:227)
>>>>>         at
>>>>> com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(ServerNotifForwarder.java:274)
>>>>>         at
>>>>> javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1270)
>>>>>         at
>>>>> javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1268)
>>>>>         at
>>>>> javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(RMIConnectionImpl.java:1274)
>>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>         at
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>>>         at
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>>>>         at
>>>>> sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
>>>>>         at sun.rmi.transport.Transport$1.run(Transport.java:200)
>>>>>         at sun.rmi.transport.Transport$1.run(Transport.java:197)
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>         at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
>>>>>         at
>>>>> sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
>>>>>         at
>>>>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
>>>>>         at
>>>>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
>>>>>         at
>>>>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$592/690739656.run(Unknown
>>>>> Source)
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>>         at
>>>>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
>>>>>         at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>         at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>         at java.lang.Thread.run(Thread.java:748)
>>>>>
>>>>> On Thu, Jul 18, 2019 at 12:25 PM Biao Liu <mmyy1...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> It seems to be good based on your GC metrics. You could double check
>>>>>> the GC log if you enable it. The GC log is more direct.
>>>>>> I'm not sure what's happening in your JobManager. But I'm pretty sure
>>>>>> that Flink could support far more larger scale cluster than yours.
>>>>>>
>>>>>> Have you ever checked the log file of JobManager? Is there any
>>>>>> suspicious warning or error log?
>>>>>> Have you ever tried some analytic tools to check the internal state
>>>>>> of JobManager, like jstack.
>>>>>>
>>>>>> It's hard to do a deeper analysis based on current informations. It
>>>>>> might be helpful if you could provide more details.
>>>>>>
>>>>>>
>>>>>> Prakhar Mathur <prakha...@go-jek.com> 于2019年7月18日周四 下午2:12写道:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We are using v1.6.2, currently, the number of TaskManagers are 70.
>>>>>>> We have the GC metrics on a dashboard also. Sum of
>>>>>>> Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is
>>>>>>> somewhere between 75 to 125
>>>>>>> and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.
>>>>>>>
>>>>>>> On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <mmyy1...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Prakhar,
>>>>>>>>
>>>>>>>> Have you ever checked the garbage collection of master?
>>>>>>>> Which version of Flink are you using? How many TaskManagers in your
>>>>>>>> cluster?
>>>>>>>>
>>>>>>>>
>>>>>>>> Prakhar Mathur <prakha...@go-jek.com> 于2019年7月18日周四 下午1:54写道:
>>>>>>>>
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> We have deployed multiple Flink clusters on Kubernetess with 1
>>>>>>>>> replica of Jobmanager and multiple of Taskmanager as per the 
>>>>>>>>> requirement.
>>>>>>>>> Recently we are observing that on increasing the number of 
>>>>>>>>> Taskmanagers for
>>>>>>>>> a cluster, the Jobmanager becomes irresponsive. It stops sending 
>>>>>>>>> statsd
>>>>>>>>> metric for some irregular interval. Even the Jobmanager pod keeps
>>>>>>>>> restarting because it stops responding to the liveliness probe which
>>>>>>>>> results in Kubernetes killing the pod. We tried increasing the 
>>>>>>>>> resources
>>>>>>>>> given(CPU, RAM) but it didn't help.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Prakhar Mathur
>>>>>>>>> Product Engineer
>>>>>>>>> GO-JEK
>>>>>>>>>
>>>>>>>>

Reply via email to