Hi,

We are seeing really weird GC behaviour on our Kafka cluster. After a clean
stop/start of the whole cluster, the servers go into a loop of increasingly
slow GC cycles for an hour and then seem to recover to a normal state
without any external changes to the servers or traffic. The load to the
cluster is minimal - 10->100 req/s.

Our setup:

- Kafka 0.8.1.1
- 3 brokers running in AWS (t2.medium instances with EBS storage)
- 4GB memory with 1.5GB allocated to the Java heap
- Under 10 topics, with replication factor of 3  & 1-10 partitions per topic
- Ubuntu 14.04, Oracle JDK 1.7

As mentioned, we start up the Kafka cluster and start running very low load
to it (less than 100 messages/second produced, initially in the low tens).
After a couple of minutes the brokers start running GC with each cycle
taking longer (1s -> 60s), which results in the cluster moving partition
leadership around (presumably as the leader becomes unresponsive during
GC). Them nearly an hour later the GC times drop down and end up in the
10-20 ms range after a few cycles. And the load to the cluster was actually
increasing at this time (but still probably less than 100 req/s). It
doesn't seem to be directly related to an external resource (e.g EBS or
instance problems) as we've seen this multiple times and with newly
provisioned servers also.

Here's an example of a GC log from one broker:

Java HotSpot(TM) 64-Bit Server VM (24.76-b04) for linux-amd64 JRE
(1.7.0_76-b13), built on Dec 18 2014 16:58:39 by "java_re" with gcc 4.3.0
20080428 (Red Hat 4.3.0-8)

Memory: 4k page, physical 4046804k(2532400k free), swap 0k(0k free)

CommandLine flags: -XX:+CMSClassUnloadingEnabled
-XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC
-XX:InitialHeapSize=1572864000 -XX:+ManagementServer
-XX:MaxHeapSize=1572864000 -XX:MaxNewSize=174485504
-XX:MaxTenuringThreshold=6 -XX:NewSize=174485504 -XX:OldPLABSize=16
-XX:OldSize=348971008 -XX:+PrintGC -XX:+PrintGCDateStamps
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC

2015-03-26T07:16:50.119+0200: 16.396: [GC [1 CMS-initial-mark:
512000K(1365632K)] 593874K(1518976K), 0.0188480 secs] [Times: user=0.02
sys=0.00, real=0.02 secs]

2015-03-26T07:16:50.138+0200: 16.415: [CMS-concurrent-mark-start]

2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-mark: 0.042/0.044
secs] [Times: user=0.06 sys=0.01, real=0.04 secs]

2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-preclean-start]

2015-03-26T07:16:50.202+0200: 16.479: [CMS-concurrent-preclean: 0.019/0.020
secs] [Times: user=0.03 sys=0.01, real=0.02 secs]

2015-03-26T07:16:50.202+0200: 16.479:
[CMS-concurrent-abortable-preclean-start]

 CMS: abort preclean due to time 2015-03-26T07:16:55.286+0200: 21.563:
[CMS-concurrent-abortable-preclean: 0.178/5.085 secs] [Times: user=0.75
sys=0.01, real=5.09 secs]

2015-03-26T07:16:55.287+0200: 21.564: [GC[YG occupancy: 106584 K (153344
K)]2015-03-26T07:16:55.287+0200: 21.564: [GC2015-03-26T07:16:55.287+0200:
21.564: [ParNew: 106584K->10552K(153344K), 0.0189980 secs]
618584K->522552K(1518976K), 0.0190710 secs] [Times: user=0.04 sys=0.00,
real=0.02 secs]

2015-03-26T07:16:55.306+0200: 21.583: [Rescan (parallel) , 0.0085400
secs]2015-03-26T07:16:55.314+0200: 21.591: [weak refs processing, 0.0000230
secs]2015-03-26T07:16:55.314+0200: 21.591: [class unloading, 0.0006920
secs]2015-03-26T07:16:55.315+0200: 21.592: [scrub symbol table, 0.0011410
secs]2015-03-26T07:16:55.316+0200: 21.593: [scrub string table, 0.0001530
secs] [1 CMS-remark: 512000K(1365632K)] 522552K(1518976K), 0.0298860 secs]
[Times: user=0.06 sys=0.00, real=0.03 secs]

2015-03-26T07:16:55.318+0200: 21.595: [CMS-concurrent-sweep-start]

2015-03-26T07:16:55.322+0200: 21.599: [CMS-concurrent-sweep: 0.004/0.004
secs] [Times: user=0.01 sys=0.00, real=0.00 secs]

2015-03-26T07:16:55.322+0200: 21.600: [CMS-concurrent-reset-start]

2015-03-26T07:16:55.329+0200: 21.606: [CMS-concurrent-reset: 0.006/0.007
secs] [Times: user=0.00 sys=0.01, real=0.01 secs]

2015-03-26T07:20:08.027+0200: 214.304: [GC2015-03-26T07:20:08.027+0200:
214.304: [ParNew: 146872K->2825K(153344K), 2.7813230 secs]
658872K->518034K(1518976K), 2.7814290 secs] [Times: user=5.58 sys=0.00,
real=2.78 secs]

2015-03-26T07:20:56.462+0200: 262.739: [GC2015-03-26T07:20:56.462+0200:
262.739: [ParNew: 139145K->6899K(153344K), 2.9533990 secs]
654354K->522108K(1518976K), 2.9534960 secs] [Times: user=5.93 sys=0.00,
real=2.95 secs]

2015-03-26T07:21:51.915+0200: 318.192: [GC2015-03-26T07:21:51.915+0200:
318.192: [ParNew: 143219K->14567K(153344K), 2.9688390 secs]
658428K->529776K(1518976K), 2.9689670 secs] [Times: user=5.97 sys=0.00,
real=2.97 secs]

2015-03-26T07:22:55.621+0200: 381.898: [GC2015-03-26T07:22:55.621+0200:
381.898: [ParNew: 150887K->14000K(153344K), 4.3638880 secs]
666096K->530231K(1518976K), 4.3639820 secs] [Times: user=8.76 sys=0.00,
real=4.37 secs]

2015-03-26T07:24:02.103+0200: 448.380: [GC2015-03-26T07:24:02.103+0200:
448.380: [ParNew: 150320K->7374K(153344K), 4.4849670 secs]
666551K->523605K(1518976K), 4.4850660 secs] [Times: user=9.00 sys=0.00,
real=4.49 secs]

2015-03-26T07:25:06.110+0200: 512.388: [GC2015-03-26T07:25:06.111+0200:
512.388: [ParNew: 143694K->14118K(153344K), 4.4852630 secs]
659925K->530349K(1518976K), 4.4853500 secs] [Times: user=9.00 sys=0.00,
real=4.48 secs]

2015-03-26T07:26:14.406+0200: 580.684: [GC2015-03-26T07:26:14.407+0200:
580.684: [ParNew: 150438K->16563K(153344K), 11.0235000 secs]
666669K->537384K(1518976K), 11.0236080 secs] [Times: user=22.12 sys=0.00,
real=11.02 secs]

2015-03-26T07:27:59.141+0200: 685.418: [GC2015-03-26T07:27:59.141+0200:
685.418: [ParNew: 152883K->13870K(153344K), 13.1984820 secs]
673704K->534691K(1518976K), 13.1985800 secs] [Times: user=26.44 sys=0.00,
real=13.20 secs]

2015-03-26T07:29:45.825+0200: 792.102: [GC2015-03-26T07:29:45.825+0200:
792.102: [ParNew: 150190K->8749K(153344K), 16.4442890 secs]
671011K->531093K(1518976K), 16.4443990 secs] [Times: user=32.92 sys=0.00,
real=16.44 secs]

2015-03-26T07:31:43.393+0200: 909.670: [GC2015-03-26T07:31:43.393+0200:
909.670: [ParNew: 145069K->16666K(153344K), 17.1249960 secs]
667413K->539010K(1518976K), 17.1250910 secs] [Times: user=34.28 sys=0.00,
real=17.13 secs]

2015-03-26T07:33:48.146+0200: 1034.423: [GC2015-03-26T07:33:48.146+0200:
1034.423: [ParNew: 152986K->10212K(153344K), 23.7178960 secs]
675330K->536361K(1518976K), 23.7180010 secs] [Times: user=47.45 sys=0.00,
real=23.71 secs]

2015-03-26T07:35:41.187+0200: 1147.464: [GC2015-03-26T07:35:41.187+0200:
1147.464: [ParNew: 146532K->15576K(153344K), 25.4787270 secs]
672681K->541725K(1518976K), 25.4788360 secs] [Times: user=50.97 sys=0.00,
real=25.48 secs]

2015-03-26T07:37:26.072+0200: 1252.349: [GC2015-03-26T07:37:26.072+0200:
1252.349: [ParNew: 151896K->9807K(153344K), 30.2121220 secs]
678045K->539111K(1518976K), 30.2122180 secs] [Times: user=60.42 sys=0.00,
real=30.22 secs]

2015-03-26T07:40:14.518+0200: 1420.795: [GC2015-03-26T07:40:14.518+0200:
1420.795: [ParNew: 146127K->17024K(153344K), 32.0691660 secs]
675431K->546342K(1518976K), 32.0692640 secs] [Times: user=64.11 sys=0.00,
real=32.07 secs]

2015-03-26T07:41:47.096+0200: 1513.373: [GC2015-03-26T07:41:47.096+0200:
1513.373: [ParNew: 153344K->10479K(153344K), 40.0872490 secs]
682662K->543861K(1518976K), 40.0873620 secs] [Times: user=80.16 sys=0.00,
real=40.09 secs]

2015-03-26T07:43:20.840+0200: 1607.117: [GC2015-03-26T07:43:20.840+0200:
1607.118: [ParNew: 146799K->17024K(153344K), 41.4909450 secs]
680181K->550413K(1518976K), 41.4910490 secs] [Times: user=82.97 sys=0.00,
real=41.49 secs]

2015-03-26T07:45:38.200+0200: 1744.477: [GC2015-03-26T07:45:38.200+0200:
1744.477: [ParNew: 153344K->11552K(153344K), 44.5490790 secs]
686733K->546792K(1518976K), 44.5491900 secs] [Times: user=89.08 sys=0.00,
real=44.55 secs]

2015-03-26T07:49:56.546+0200: 2002.823: [GC2015-03-26T07:49:56.546+0200:
2002.823: [ParNew: 147872K->10963K(153344K), 45.4665010 secs]
683112K->546204K(1518976K), 45.4666110 secs] [Times: user=90.91 sys=0.00,
real=45.46 secs]

2015-03-26T07:54:35.158+0200: 2281.435: [GC2015-03-26T07:54:35.158+0200:
2281.435: [ParNew: 147283K->12000K(153344K), 45.3558830 secs]
682524K->547241K(1518976K), 45.3559960 secs] [Times: user=90.68 sys=0.00,
real=45.35 secs]

2015-03-26T07:58:58.888+0200: 2545.165: [GC2015-03-26T07:58:58.888+0200:
2545.165: [ParNew: 148320K->8521K(153344K), 50.8855520 secs]
683561K->546682K(1518976K), 50.8856520 secs] [Times: user=101.73 sys=0.00,
real=50.88 secs]

2015-03-26T08:03:25.615+0200: 2811.892: [GC2015-03-26T08:03:25.615+0200:
2811.892: [ParNew: 144841K->6887K(153344K), 56.1553500 secs]
683002K->547305K(1518976K), 56.1554700 secs] [Times: user=112.26 sys=0.00,
real=56.16 secs]

2015-03-26T08:07:59.738+0200: 3086.015: [GC2015-03-26T08:07:59.738+0200:
3086.015: [ParNew: 143207K->5223K(153344K), 11.1719880 secs]
683625K->547498K(1518976K), 11.1720970 secs] [Times: user=22.33 sys=0.00,
real=11.17 secs]

2015-03-26T08:11:41.909+0200: 3308.186: [GC2015-03-26T08:11:41.909+0200:
3308.186: [ParNew: 141543K->5778K(153344K), 0.0121180 secs]
683818K->548443K(1518976K), 0.0121990 secs] [Times: user=0.02 sys=0.00,
real=0.01 secs]

2015-03-26T08:15:05.817+0200: 3512.094: [GC2015-03-26T08:15:05.817+0200:
3512.094: [ParNew: 142098K->10943K(153344K), 0.0228440 secs]
684763K->554166K(1518976K), 0.0229390 secs] [Times: user=0.04 sys=0.00,
real=0.02 secs]

2015-03-26T08:16:50.297+0200: 3616.574: [GC2015-03-26T08:16:50.297+0200:
3616.574: [ParNew: 147263K->16262K(153344K), 0.0269190 secs]
690486K->560618K(1518976K), 0.0270040 secs] [Times: user=0.05 sys=0.01,
real=0.02 secs]

We know the EBS storage is less than ideal, but we've tested a similar
setup with significantly higher simulated load without any issues.

Any suggestions on what to check, test etc?


Best regards,

Timo

Reply via email to