Hi, We are seeing really weird GC behaviour on our Kafka cluster. After a clean stop/start of the whole cluster, the servers go into a loop of increasingly slow GC cycles for an hour and then seem to recover to a normal state without any external changes to the servers or traffic. The load to the cluster is minimal - 10->100 req/s.
Our setup: - Kafka 0.8.1.1 - 3 brokers running in AWS (t2.medium instances with EBS storage) - 4GB memory with 1.5GB allocated to the Java heap - Under 10 topics, with replication factor of 3 & 1-10 partitions per topic - Ubuntu 14.04, Oracle JDK 1.7 As mentioned, we start up the Kafka cluster and start running very low load to it (less than 100 messages/second produced, initially in the low tens). After a couple of minutes the brokers start running GC with each cycle taking longer (1s -> 60s), which results in the cluster moving partition leadership around (presumably as the leader becomes unresponsive during GC). Them nearly an hour later the GC times drop down and end up in the 10-20 ms range after a few cycles. And the load to the cluster was actually increasing at this time (but still probably less than 100 req/s). It doesn't seem to be directly related to an external resource (e.g EBS or instance problems) as we've seen this multiple times and with newly provisioned servers also. Here's an example of a GC log from one broker: Java HotSpot(TM) 64-Bit Server VM (24.76-b04) for linux-amd64 JRE (1.7.0_76-b13), built on Dec 18 2014 16:58:39 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) Memory: 4k page, physical 4046804k(2532400k free), swap 0k(0k free) CommandLine flags: -XX:+CMSClassUnloadingEnabled -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC -XX:InitialHeapSize=1572864000 -XX:+ManagementServer -XX:MaxHeapSize=1572864000 -XX:MaxNewSize=174485504 -XX:MaxTenuringThreshold=6 -XX:NewSize=174485504 -XX:OldPLABSize=16 -XX:OldSize=348971008 -XX:+PrintGC -XX:+PrintGCDateStamps -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops -XX:+UseConcMarkSweepGC -XX:+UseParNewGC 2015-03-26T07:16:50.119+0200: 16.396: [GC [1 CMS-initial-mark: 512000K(1365632K)] 593874K(1518976K), 0.0188480 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 2015-03-26T07:16:50.138+0200: 16.415: [CMS-concurrent-mark-start] 2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-mark: 0.042/0.044 secs] [Times: user=0.06 sys=0.01, real=0.04 secs] 2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-preclean-start] 2015-03-26T07:16:50.202+0200: 16.479: [CMS-concurrent-preclean: 0.019/0.020 secs] [Times: user=0.03 sys=0.01, real=0.02 secs] 2015-03-26T07:16:50.202+0200: 16.479: [CMS-concurrent-abortable-preclean-start] CMS: abort preclean due to time 2015-03-26T07:16:55.286+0200: 21.563: [CMS-concurrent-abortable-preclean: 0.178/5.085 secs] [Times: user=0.75 sys=0.01, real=5.09 secs] 2015-03-26T07:16:55.287+0200: 21.564: [GC[YG occupancy: 106584 K (153344 K)]2015-03-26T07:16:55.287+0200: 21.564: [GC2015-03-26T07:16:55.287+0200: 21.564: [ParNew: 106584K->10552K(153344K), 0.0189980 secs] 618584K->522552K(1518976K), 0.0190710 secs] [Times: user=0.04 sys=0.00, real=0.02 secs] 2015-03-26T07:16:55.306+0200: 21.583: [Rescan (parallel) , 0.0085400 secs]2015-03-26T07:16:55.314+0200: 21.591: [weak refs processing, 0.0000230 secs]2015-03-26T07:16:55.314+0200: 21.591: [class unloading, 0.0006920 secs]2015-03-26T07:16:55.315+0200: 21.592: [scrub symbol table, 0.0011410 secs]2015-03-26T07:16:55.316+0200: 21.593: [scrub string table, 0.0001530 secs] [1 CMS-remark: 512000K(1365632K)] 522552K(1518976K), 0.0298860 secs] [Times: user=0.06 sys=0.00, real=0.03 secs] 2015-03-26T07:16:55.318+0200: 21.595: [CMS-concurrent-sweep-start] 2015-03-26T07:16:55.322+0200: 21.599: [CMS-concurrent-sweep: 0.004/0.004 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 2015-03-26T07:16:55.322+0200: 21.600: [CMS-concurrent-reset-start] 2015-03-26T07:16:55.329+0200: 21.606: [CMS-concurrent-reset: 0.006/0.007 secs] [Times: user=0.00 sys=0.01, real=0.01 secs] 2015-03-26T07:20:08.027+0200: 214.304: [GC2015-03-26T07:20:08.027+0200: 214.304: [ParNew: 146872K->2825K(153344K), 2.7813230 secs] 658872K->518034K(1518976K), 2.7814290 secs] [Times: user=5.58 sys=0.00, real=2.78 secs] 2015-03-26T07:20:56.462+0200: 262.739: [GC2015-03-26T07:20:56.462+0200: 262.739: [ParNew: 139145K->6899K(153344K), 2.9533990 secs] 654354K->522108K(1518976K), 2.9534960 secs] [Times: user=5.93 sys=0.00, real=2.95 secs] 2015-03-26T07:21:51.915+0200: 318.192: [GC2015-03-26T07:21:51.915+0200: 318.192: [ParNew: 143219K->14567K(153344K), 2.9688390 secs] 658428K->529776K(1518976K), 2.9689670 secs] [Times: user=5.97 sys=0.00, real=2.97 secs] 2015-03-26T07:22:55.621+0200: 381.898: [GC2015-03-26T07:22:55.621+0200: 381.898: [ParNew: 150887K->14000K(153344K), 4.3638880 secs] 666096K->530231K(1518976K), 4.3639820 secs] [Times: user=8.76 sys=0.00, real=4.37 secs] 2015-03-26T07:24:02.103+0200: 448.380: [GC2015-03-26T07:24:02.103+0200: 448.380: [ParNew: 150320K->7374K(153344K), 4.4849670 secs] 666551K->523605K(1518976K), 4.4850660 secs] [Times: user=9.00 sys=0.00, real=4.49 secs] 2015-03-26T07:25:06.110+0200: 512.388: [GC2015-03-26T07:25:06.111+0200: 512.388: [ParNew: 143694K->14118K(153344K), 4.4852630 secs] 659925K->530349K(1518976K), 4.4853500 secs] [Times: user=9.00 sys=0.00, real=4.48 secs] 2015-03-26T07:26:14.406+0200: 580.684: [GC2015-03-26T07:26:14.407+0200: 580.684: [ParNew: 150438K->16563K(153344K), 11.0235000 secs] 666669K->537384K(1518976K), 11.0236080 secs] [Times: user=22.12 sys=0.00, real=11.02 secs] 2015-03-26T07:27:59.141+0200: 685.418: [GC2015-03-26T07:27:59.141+0200: 685.418: [ParNew: 152883K->13870K(153344K), 13.1984820 secs] 673704K->534691K(1518976K), 13.1985800 secs] [Times: user=26.44 sys=0.00, real=13.20 secs] 2015-03-26T07:29:45.825+0200: 792.102: [GC2015-03-26T07:29:45.825+0200: 792.102: [ParNew: 150190K->8749K(153344K), 16.4442890 secs] 671011K->531093K(1518976K), 16.4443990 secs] [Times: user=32.92 sys=0.00, real=16.44 secs] 2015-03-26T07:31:43.393+0200: 909.670: [GC2015-03-26T07:31:43.393+0200: 909.670: [ParNew: 145069K->16666K(153344K), 17.1249960 secs] 667413K->539010K(1518976K), 17.1250910 secs] [Times: user=34.28 sys=0.00, real=17.13 secs] 2015-03-26T07:33:48.146+0200: 1034.423: [GC2015-03-26T07:33:48.146+0200: 1034.423: [ParNew: 152986K->10212K(153344K), 23.7178960 secs] 675330K->536361K(1518976K), 23.7180010 secs] [Times: user=47.45 sys=0.00, real=23.71 secs] 2015-03-26T07:35:41.187+0200: 1147.464: [GC2015-03-26T07:35:41.187+0200: 1147.464: [ParNew: 146532K->15576K(153344K), 25.4787270 secs] 672681K->541725K(1518976K), 25.4788360 secs] [Times: user=50.97 sys=0.00, real=25.48 secs] 2015-03-26T07:37:26.072+0200: 1252.349: [GC2015-03-26T07:37:26.072+0200: 1252.349: [ParNew: 151896K->9807K(153344K), 30.2121220 secs] 678045K->539111K(1518976K), 30.2122180 secs] [Times: user=60.42 sys=0.00, real=30.22 secs] 2015-03-26T07:40:14.518+0200: 1420.795: [GC2015-03-26T07:40:14.518+0200: 1420.795: [ParNew: 146127K->17024K(153344K), 32.0691660 secs] 675431K->546342K(1518976K), 32.0692640 secs] [Times: user=64.11 sys=0.00, real=32.07 secs] 2015-03-26T07:41:47.096+0200: 1513.373: [GC2015-03-26T07:41:47.096+0200: 1513.373: [ParNew: 153344K->10479K(153344K), 40.0872490 secs] 682662K->543861K(1518976K), 40.0873620 secs] [Times: user=80.16 sys=0.00, real=40.09 secs] 2015-03-26T07:43:20.840+0200: 1607.117: [GC2015-03-26T07:43:20.840+0200: 1607.118: [ParNew: 146799K->17024K(153344K), 41.4909450 secs] 680181K->550413K(1518976K), 41.4910490 secs] [Times: user=82.97 sys=0.00, real=41.49 secs] 2015-03-26T07:45:38.200+0200: 1744.477: [GC2015-03-26T07:45:38.200+0200: 1744.477: [ParNew: 153344K->11552K(153344K), 44.5490790 secs] 686733K->546792K(1518976K), 44.5491900 secs] [Times: user=89.08 sys=0.00, real=44.55 secs] 2015-03-26T07:49:56.546+0200: 2002.823: [GC2015-03-26T07:49:56.546+0200: 2002.823: [ParNew: 147872K->10963K(153344K), 45.4665010 secs] 683112K->546204K(1518976K), 45.4666110 secs] [Times: user=90.91 sys=0.00, real=45.46 secs] 2015-03-26T07:54:35.158+0200: 2281.435: [GC2015-03-26T07:54:35.158+0200: 2281.435: [ParNew: 147283K->12000K(153344K), 45.3558830 secs] 682524K->547241K(1518976K), 45.3559960 secs] [Times: user=90.68 sys=0.00, real=45.35 secs] 2015-03-26T07:58:58.888+0200: 2545.165: [GC2015-03-26T07:58:58.888+0200: 2545.165: [ParNew: 148320K->8521K(153344K), 50.8855520 secs] 683561K->546682K(1518976K), 50.8856520 secs] [Times: user=101.73 sys=0.00, real=50.88 secs] 2015-03-26T08:03:25.615+0200: 2811.892: [GC2015-03-26T08:03:25.615+0200: 2811.892: [ParNew: 144841K->6887K(153344K), 56.1553500 secs] 683002K->547305K(1518976K), 56.1554700 secs] [Times: user=112.26 sys=0.00, real=56.16 secs] 2015-03-26T08:07:59.738+0200: 3086.015: [GC2015-03-26T08:07:59.738+0200: 3086.015: [ParNew: 143207K->5223K(153344K), 11.1719880 secs] 683625K->547498K(1518976K), 11.1720970 secs] [Times: user=22.33 sys=0.00, real=11.17 secs] 2015-03-26T08:11:41.909+0200: 3308.186: [GC2015-03-26T08:11:41.909+0200: 3308.186: [ParNew: 141543K->5778K(153344K), 0.0121180 secs] 683818K->548443K(1518976K), 0.0121990 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 2015-03-26T08:15:05.817+0200: 3512.094: [GC2015-03-26T08:15:05.817+0200: 3512.094: [ParNew: 142098K->10943K(153344K), 0.0228440 secs] 684763K->554166K(1518976K), 0.0229390 secs] [Times: user=0.04 sys=0.00, real=0.02 secs] 2015-03-26T08:16:50.297+0200: 3616.574: [GC2015-03-26T08:16:50.297+0200: 3616.574: [ParNew: 147263K->16262K(153344K), 0.0269190 secs] 690486K->560618K(1518976K), 0.0270040 secs] [Times: user=0.05 sys=0.01, real=0.02 secs] We know the EBS storage is less than ideal, but we've tested a similar setup with significantly higher simulated load without any issues. Any suggestions on what to check, test etc? Best regards, Timo