Hi Prakhar,

Sorry I could not find any abnormal message from your GC log and stack
trace.
Have you ever tried deploying the cluster in other ways? Not on Kubernetes.
Like on YARN or standalone. Just for narrowing down the scope.


On Tue, Jul 23, 2019 at 12:34 PM Prakhar Mathur <prakha...@go-jek.com>
wrote:

>
> On Mon, Jul 22, 2019, 16:08 Prakhar Mathur <prakha...@go-jek.com> wrote:
>
>> Hi,
>>
>> We enabled GC logging, here are the logs
>>
>> [GC (Allocation Failure) [PSYoungGen: 6482015K->70303K(6776832K)]
>> 6955827K->544194K(20823552K), 0.0591479 secs] [Times: user=0.09 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6587039K->38642K(6763008K)]
>> 7060930K->512614K(20809728K), 0.0740978 secs] [Times: user=0.08 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6502858K->192077K(6734848K)]
>> 6976829K->666144K(20781568K), 0.0841759 secs] [Times: user=0.17 sys=0.00,
>> real=0.09 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6647378K->50108K(6759424K)]
>> 7121446K->524248K(20806144K), 0.0622997 secs] [Times: user=0.08 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6501890K->60606K(6779904K)]
>> 6976029K->534961K(20826624K), 0.0637955 secs] [Times: user=0.09 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6586046K->40411K(6768640K)]
>> 7060401K->514839K(20815360K), 0.0729137 secs] [Times: user=0.08 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6543919K->51886K(6797824K)]
>> 7018346K->526385K(20844544K), 0.0649143 secs] [Times: user=0.09 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6601690K->191832K(6754304K)]
>> 7076190K->666427K(20801024K), 0.1029686 secs] [Times: user=0.18 sys=0.00,
>> real=0.10 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6742947K->62693K(6781952K)]
>> 7217543K->537361K(20828672K), 0.0639272 secs] [Times: user=0.09 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6589986K->66299K(6770688K)]
>> 7064653K->541039K(20817408K), 0.0701853 secs] [Times: user=0.08 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6590742K->42995K(6800896K)]
>> 7065481K->517798K(20847616K), 0.0595729 secs] [Times: user=0.08 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6608678K->66127K(6793728K)]
>> 7083482K->541011K(20840448K), 0.0608270 secs] [Times: user=0.09 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6608886K->62063K(6822400K)]
>> 7083769K->537027K(20869120K), 0.0675917 secs] [Times: user=0.10 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6617146K->200674K(6812160K)]
>> 7092110K->812325K(20858880K), 1.1685237 secs] [Times: user=3.53 sys=0.71,
>> real=1.17 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6773610K->194848K(6633984K)]
>> 7385261K->806700K(20680704K), 0.0858601 secs] [Times: user=0.19 sys=0.00,
>> real=0.09 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6617888K->44002K(6723072K)]
>> 7229740K->655854K(20769792K), 0.0647387 secs] [Times: user=0.09 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6420043K->194672K(6702080K)]
>> 7031895K->806604K(20748800K), 0.0833792 secs] [Times: user=0.18 sys=0.00,
>> real=0.08 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6603376K->187059K(6596096K)]
>> 7215308K->799063K(20642816K), 0.0906529 secs] [Times: user=0.17 sys=0.00,
>> real=0.09 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6572690K->51850K(6715904K)]
>> 7184694K->663990K(20762624K), 0.0837285 secs] [Times: user=0.11 sys=0.00,
>> real=0.09 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6452380K->44766K(6708224K)]
>> 7064519K->656993K(20754944K), 0.0809864 secs] [Times: user=0.10 sys=0.00,
>> real=0.08 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6445790K->42654K(6730752K)]
>> 7058017K->654961K(20777472K), 0.0686401 secs] [Times: user=0.08 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6473374K->41410K(6723072K)]
>> 7085681K->654085K(20769792K), 0.0714228 secs] [Times: user=0.08 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6448507K->59031K(6752256K)]
>> 7061182K->671786K(20798976K), 0.0599829 secs] [Times: user=0.09 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6489819K->68484K(6739968K)]
>> 7102574K->681619K(20786688K), 0.0674541 secs] [Times: user=0.09 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6537251K->257001K(6625792K)]
>> 7150386K->938268K(20672512K), 1.1775771 secs] [Times: user=4.11 sys=0.22,
>> real=1.18 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6575609K->76090K(6695936K)]
>> 7256876K->757765K(20742656K), 0.0805662 secs] [Times: user=0.11 sys=0.01,
>> real=0.08 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6444858K->42242K(6719488K)]
>> 7126533K->723917K(20766208K), 0.0611949 secs] [Times: user=0.09 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6399332K->143985K(6710784K)]
>> 7081006K->826084K(20757504K), 0.0771257 secs] [Times: user=0.16 sys=0.00,
>> real=0.08 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6503905K->67207K(6736896K)]
>> 7186004K->750758K(20783616K), 0.0808856 secs] [Times: user=0.11 sys=0.01,
>> real=0.08 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6468318K->59200K(6724608K)]
>> 7151869K->743099K(20771328K), 0.0709461 secs] [Times: user=0.09 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6484294K->51134K(6757376K)]
>> 7168193K->735120K(20804096K), 0.0772947 secs] [Times: user=0.11 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6503020K->77337K(6747648K)]
>> 7187007K->761395K(20794368K), 0.0835827 secs] [Times: user=0.12 sys=0.00,
>> real=0.09 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6505125K->66972K(6783488K)]
>> 7189184K->751110K(20830208K), 0.0671168 secs] [Times: user=0.09 sys=0.00,
>> real=0.06 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6544467K->201750K(6734336K)]
>> 7228606K->885969K(20781056K), 0.0931485 secs] [Times: user=0.20 sys=0.00,
>> real=0.10 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6733846K->43858K(6763520K)]
>> 7418065K->728157K(20810240K), 0.0683046 secs] [Times: user=0.09 sys=0.00,
>> real=0.07 secs]
>> [GC (Allocation Failure) [PSYoungGen: 6525200K->66239K(6756864K)]
>> 7209498K->750617K(20803584K), 0.0764187 secs] [Times: user=0.10 sys=0.00,
>> real=0.08 secs]
>>
>>
>> We were going through the thread dump, but couldn't find a reason for
>> such behaviour
>>
>> "flink-rest-server-netty-worker-thread-8" - Thread t@126233
>>    java.lang.Thread.State: RUNNABLE
>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>         at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>         at
>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>         at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>         - locked <2654bbbf> (a
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>         - locked <45bd4356> (a java.util.Collections$UnmodifiableSet)
>>         - locked <24d4aa0> (a sun.nio.ch.EPollSelectorImpl)
>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>         at java.lang.Thread.run(Thread.java:748)
>>
>>    Locked ownable synchronizers:
>>         - None
>>
>> "flink-rest-server-netty-worker-thread-7" - Thread t@123834
>>    java.lang.Thread.State: RUNNABLE
>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>         at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>         at
>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>         at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>         - locked <2a9b97d7> (a
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>         - locked <29115878> (a java.util.Collections$UnmodifiableSet)
>>         - locked <61909ff> (a sun.nio.ch.EPollSelectorImpl)
>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>         at java.lang.Thread.run(Thread.java:748)
>>
>>    Locked ownable synchronizers:
>>         - None
>>
>> "flink-rest-server-netty-worker-thread-6" - Thread t@121404
>>    java.lang.Thread.State: RUNNABLE
>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>         at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>         at
>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>         at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>         - locked <6ee94bd6> (a
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>         - locked <183a4cf3> (a java.util.Collections$UnmodifiableSet)
>>         - locked <3b4b1514> (a sun.nio.ch.EPollSelectorImpl)
>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>         at java.lang.Thread.run(Thread.java:748)
>>
>>    Locked ownable synchronizers:
>>         - None
>>
>> "flink-rest-server-netty-worker-thread-5" - Thread t@118945
>>    java.lang.Thread.State: RUNNABLE
>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>         at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>         at
>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>         at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>         - locked <7553b573> (a
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>         - locked <51c0f7f4> (a java.util.Collections$UnmodifiableSet)
>>         - locked <1bd774eb> (a sun.nio.ch.EPollSelectorImpl)
>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>         at java.lang.Thread.run(Thread.java:748)
>>
>>    Locked ownable synchronizers:
>>         - None
>>
>> "flink-rest-server-netty-worker-thread-4" - Thread t@117073
>>    java.lang.Thread.State: RUNNABLE
>>         at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
>>         at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
>>         at
>> sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
>>         at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>>         - locked <78572130> (a
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySet)
>>         - locked <1047fb64> (a java.util.Collections$UnmodifiableSet)
>>         - locked <466ec1cf> (a sun.nio.ch.EPollSelectorImpl)
>>         at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:765)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:413)
>>         at
>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909)
>>         at java.lang.Thread.run(Thread.java:748)
>>
>>    Locked ownable synchronizers:
>>         - None
>>
>> "gcs-async-channel-pool-19" - Thread t@116997
>>    java.lang.Thread.State: TIMED_WAITING
>>         at sun.misc.Unsafe.park(Native Method)
>>         - parking to wait for <1be52a4a> (a
>> java.util.concurrent.SynchronousQueue$TransferStack)
>>         at
>> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>>         at
>> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>>         at
>> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>>         at
>> java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>         at java.lang.Thread.run(Thread.java:748)
>>
>>    Locked ownable synchronizers:
>>         - None
>>
>> "RMI TCP Connection(36)-127.0.0.1" - Thread t@114070
>>    java.lang.Thread.State: TIMED_WAITING
>>         at java.lang.Object.wait(Native Method)
>>         - waiting on <22b14b0b> (a
>> com.sun.jmx.remote.internal.ArrayNotificationBuffer)
>>         at
>> com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(ArrayNotificationBuffer.java:449)
>>         at
>> com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(ArrayNotificationBuffer.java:227)
>>         at
>> com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(ServerNotifForwarder.java:274)
>>         at
>> javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1270)
>>         at
>> javax.management.remote.rmi.RMIConnectionImpl$4.run(RMIConnectionImpl.java:1268)
>>         at
>> javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(RMIConnectionImpl.java:1274)
>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>         at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>         at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>         at java.lang.reflect.Method.invoke(Method.java:498)
>>         at
>> sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:357)
>>         at sun.rmi.transport.Transport$1.run(Transport.java:200)
>>         at sun.rmi.transport.Transport$1.run(Transport.java:197)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at sun.rmi.transport.Transport.serviceCall(Transport.java:196)
>>         at
>> sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:573)
>>         at
>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:834)
>>         at
>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:688)
>>         at
>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$592/690739656.run(Unknown
>> Source)
>>         at java.security.AccessController.doPrivileged(Native Method)
>>         at
>> sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:687)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>         at java.lang.Thread.run(Thread.java:748)
>>
>> On Thu, Jul 18, 2019 at 12:25 PM Biao Liu <mmyy1...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> It seems to be good based on your GC metrics. You could double check the
>>> GC log if you enable it. The GC log is more direct.
>>> I'm not sure what's happening in your JobManager. But I'm pretty sure
>>> that Flink could support far more larger scale cluster than yours.
>>>
>>> Have you ever checked the log file of JobManager? Is there any
>>> suspicious warning or error log?
>>> Have you ever tried some analytic tools to check the internal state of
>>> JobManager, like jstack.
>>>
>>> It's hard to do a deeper analysis based on current informations. It
>>> might be helpful if you could provide more details.
>>>
>>>
>>> Prakhar Mathur <prakha...@go-jek.com> 于2019年7月18日周四 下午2:12写道:
>>>
>>>> Hi,
>>>>
>>>> We are using v1.6.2, currently, the number of TaskManagers are 70. We
>>>> have the GC metrics on a dashboard also. Sum of
>>>> Status.JVM.GarbageCollector.MarkSweepCompact.Time grouped by 1 min is
>>>> somewhere between 75 to 125
>>>> and Status.JVM.GarbageCollector.MarkSweepCompact.Count is fixed at 10.
>>>>
>>>> On Thu, Jul 18, 2019 at 11:32 AM Biao Liu <mmyy1...@gmail.com> wrote:
>>>>
>>>>> Hi Prakhar,
>>>>>
>>>>> Have you ever checked the garbage collection of master?
>>>>> Which version of Flink are you using? How many TaskManagers in your
>>>>> cluster?
>>>>>
>>>>>
>>>>> Prakhar Mathur <prakha...@go-jek.com> 于2019年7月18日周四 下午1:54写道:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> We have deployed multiple Flink clusters on Kubernetess with 1
>>>>>> replica of Jobmanager and multiple of Taskmanager as per the requirement.
>>>>>> Recently we are observing that on increasing the number of Taskmanagers 
>>>>>> for
>>>>>> a cluster, the Jobmanager becomes irresponsive. It stops sending statsd
>>>>>> metric for some irregular interval. Even the Jobmanager pod keeps
>>>>>> restarting because it stops responding to the liveliness probe which
>>>>>> results in Kubernetes killing the pod. We tried increasing the resources
>>>>>> given(CPU, RAM) but it didn't help.
>>>>>>
>>>>>> Regards
>>>>>> Prakhar Mathur
>>>>>> Product Engineer
>>>>>> GO-JEK
>>>>>>
>>>>>

Reply via email to