Hi, A related issue which results in a similar increasing GC cycles (but doesn't recover)... We're trying to bootstrap a new broker into the cluster (replacing one of the 3 main ones described previously). It connects fine, starts replicating data and becomes ISR, but starts a similar GC cycle on increasing duration, e.g:
2015-03-30T20:05:26.927+0300: 13.356: [CMS-concurrent-reset-start] 2015-03-30T20:05:26.929+0300: 13.359: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 2015-03-30T20:05:27.768+0300: 14.198: [GC2015-03-30T20:05:27.769+0300: 14.198: [ParNew: 141729K->4172K(153344K), 3.1090220 secs] 657022K->519464K(1518976K), 3.1091410 secs] [Times: user=6.22 sys=0.00, real=3.11 secs] 2015-03-30T20:05:31.749+0300: 18.178: [GC2015-03-30T20:05:31.749+0300: 18.178: [ParNew: 139718K->2145K(153344K), 3.1022510 secs] 655011K->517437K(1518976K), 3.1023670 secs] [Times: user=6.20 sys=0.00, real=3.10 secs] 2015-03-30T20:05:36.411+0300: 22.841: [GC2015-03-30T20:05:36.411+0300: 22.841: [ParNew: 138465K->3230K(153344K), 3.0921260 secs] 653757K->518522K(1518976K), 3.0922300 secs] [Times: user=6.18 sys=0.01, real=3.09 secs] 2015-03-30T20:05:40.392+0300: 26.821: [GC2015-03-30T20:05:40.392+0300: 26.821: [ParNew: 138209K->1170K(153344K), 3.3535970 secs] 653501K->516470K(1518976K), 3.9147070 secs] [Times: user=6.39 sys=0.00, real=3.92 secs] 2015-03-30T20:05:45.251+0300: 31.680: [GC2015-03-30T20:05:45.251+0300: 31.680: [ParNew: 136773K->2394K(153344K), 3.1427200 secs] 652073K->517706K(1518976K), 3.1428120 secs] [Times: user=6.27 sys=0.00, real=3.15 secs] 2015-03-30T20:05:50.294+0300: 36.723: [GC2015-03-30T20:05:51.469+0300: 37.899: [ParNew: 138208K->390K(153344K), 3.1697040 secs] 653519K->515701K(1518976K), 4.3452330 secs] [Times: user=6.29 sys=0.01, real=4.34 secs] 2015-03-30T20:05:55.565+0300: 41.995: [GC2015-03-30T20:05:55.566+0300: 41.995: [ParNew: 136710K->3439K(153344K), 3.1681960 secs] 652021K->518769K(1518976K), 3.1682890 secs] [Times: user=6.32 sys=0.00, real=3.17 secs] 2015-03-30T20:05:59.717+0300: 46.146: [GC2015-03-30T20:05:59.717+0300: 46.146: [ParNew: 138893K->2417K(153344K), 3.1984450 secs] 654224K->517770K(1518976K), 3.1985550 secs] [Times: user=6.40 sys=0.00, real=3.20 secs] 2015-03-30T20:06:03.741+0300: 50.170: [GC2015-03-30T20:06:03.741+0300: 50.170: [ParNew: 138737K->1389K(153344K), 3.2550790 secs] 654090K->516778K(1518976K), 3.2551820 secs] [Times: user=6.51 sys=0.00, real=3.26 secs] 2015-03-30T20:06:08.158+0300: 54.588: [GC2015-03-30T20:06:08.158+0300: 54.588: [ParNew: 137224K->2403K(153344K), 3.3705920 secs] 652613K->517813K(1518976K), 3.3707200 secs] [Times: user=6.59 sys=0.01, real=3.37 secs] .... This goes up to when the realtime of the cycle is 1 min or so and then the server becomes unresponsive. Any suggestions on what cloud be causing this? Best regards, Timo On 26 March 2015 at 19:27, Timo Ahokas <timo.aho...@gmail.com> wrote: > Hi, > > We are seeing really weird GC behaviour on our Kafka cluster. After a > clean stop/start of the whole cluster, the servers go into a loop of > increasingly slow GC cycles for an hour and then seem to recover to a > normal state without any external changes to the servers or traffic. The > load to the cluster is minimal - 10->100 req/s. > > Our setup: > > - Kafka 0.8.1.1 > - 3 brokers running in AWS (t2.medium instances with EBS storage) > - 4GB memory with 1.5GB allocated to the Java heap > - Under 10 topics, with replication factor of 3 & 1-10 partitions per > topic > - Ubuntu 14.04, Oracle JDK 1.7 > > As mentioned, we start up the Kafka cluster and start running very low > load to it (less than 100 messages/second produced, initially in the low > tens). After a couple of minutes the brokers start running GC with each > cycle taking longer (1s -> 60s), which results in the cluster moving > partition leadership around (presumably as the leader becomes unresponsive > during GC). Them nearly an hour later the GC times drop down and end up in > the 10-20 ms range after a few cycles. And the load to the cluster was > actually increasing at this time (but still probably less than 100 req/s). > It doesn't seem to be directly related to an external resource (e.g EBS or > instance problems) as we've seen this multiple times and with newly > provisioned servers also. > > Here's an example of a GC log from one broker: > > Java HotSpot(TM) 64-Bit Server VM (24.76-b04) for linux-amd64 JRE > (1.7.0_76-b13), built on Dec 18 2014 16:58:39 by "java_re" with gcc 4.3.0 > 20080428 (Red Hat 4.3.0-8) > > Memory: 4k page, physical 4046804k(2532400k free), swap 0k(0k free) > > CommandLine flags: -XX:+CMSClassUnloadingEnabled > -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC > -XX:InitialHeapSize=1572864000 -XX:+ManagementServer > -XX:MaxHeapSize=1572864000 -XX:MaxNewSize=174485504 > -XX:MaxTenuringThreshold=6 -XX:NewSize=174485504 -XX:OldPLABSize=16 > -XX:OldSize=348971008 -XX:+PrintGC -XX:+PrintGCDateStamps > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops > -XX:+UseConcMarkSweepGC -XX:+UseParNewGC > > 2015-03-26T07:16:50.119+0200: 16.396: [GC [1 CMS-initial-mark: > 512000K(1365632K)] 593874K(1518976K), 0.0188480 secs] [Times: user=0.02 > sys=0.00, real=0.02 secs] > > 2015-03-26T07:16:50.138+0200: 16.415: [CMS-concurrent-mark-start] > > 2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-mark: 0.042/0.044 > secs] [Times: user=0.06 sys=0.01, real=0.04 secs] > > 2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-preclean-start] > > 2015-03-26T07:16:50.202+0200: 16.479: [CMS-concurrent-preclean: > 0.019/0.020 secs] [Times: user=0.03 sys=0.01, real=0.02 secs] > > 2015-03-26T07:16:50.202+0200: 16.479: > [CMS-concurrent-abortable-preclean-start] > > CMS: abort preclean due to time 2015-03-26T07:16:55.286+0200: 21.563: > [CMS-concurrent-abortable-preclean: 0.178/5.085 secs] [Times: user=0.75 > sys=0.01, real=5.09 secs] > > 2015-03-26T07:16:55.287+0200: 21.564: [GC[YG occupancy: 106584 K (153344 > K)]2015-03-26T07:16:55.287+0200: 21.564: [GC2015-03-26T07:16:55.287+0200: > 21.564: [ParNew: 106584K->10552K(153344K), 0.0189980 secs] > 618584K->522552K(1518976K), 0.0190710 secs] [Times: user=0.04 sys=0.00, > real=0.02 secs] > > 2015-03-26T07:16:55.306+0200: 21.583: [Rescan (parallel) , 0.0085400 > secs]2015-03-26T07:16:55.314+0200: 21.591: [weak refs processing, 0.0000230 > secs]2015-03-26T07:16:55.314+0200: 21.591: [class unloading, 0.0006920 > secs]2015-03-26T07:16:55.315+0200: 21.592: [scrub symbol table, 0.0011410 > secs]2015-03-26T07:16:55.316+0200: 21.593: [scrub string table, 0.0001530 > secs] [1 CMS-remark: 512000K(1365632K)] 522552K(1518976K), 0.0298860 secs] > [Times: user=0.06 sys=0.00, real=0.03 secs] > > 2015-03-26T07:16:55.318+0200: 21.595: [CMS-concurrent-sweep-start] > > 2015-03-26T07:16:55.322+0200: 21.599: [CMS-concurrent-sweep: 0.004/0.004 > secs] [Times: user=0.01 sys=0.00, real=0.00 secs] > > 2015-03-26T07:16:55.322+0200: 21.600: [CMS-concurrent-reset-start] > > 2015-03-26T07:16:55.329+0200: 21.606: [CMS-concurrent-reset: 0.006/0.007 > secs] [Times: user=0.00 sys=0.01, real=0.01 secs] > > 2015-03-26T07:20:08.027+0200: 214.304: [GC2015-03-26T07:20:08.027+0200: > 214.304: [ParNew: 146872K->2825K(153344K), 2.7813230 secs] > 658872K->518034K(1518976K), 2.7814290 secs] [Times: user=5.58 sys=0.00, > real=2.78 secs] > > 2015-03-26T07:20:56.462+0200: 262.739: [GC2015-03-26T07:20:56.462+0200: > 262.739: [ParNew: 139145K->6899K(153344K), 2.9533990 secs] > 654354K->522108K(1518976K), 2.9534960 secs] [Times: user=5.93 sys=0.00, > real=2.95 secs] > > 2015-03-26T07:21:51.915+0200: 318.192: [GC2015-03-26T07:21:51.915+0200: > 318.192: [ParNew: 143219K->14567K(153344K), 2.9688390 secs] > 658428K->529776K(1518976K), 2.9689670 secs] [Times: user=5.97 sys=0.00, > real=2.97 secs] > > 2015-03-26T07:22:55.621+0200: 381.898: [GC2015-03-26T07:22:55.621+0200: > 381.898: [ParNew: 150887K->14000K(153344K), 4.3638880 secs] > 666096K->530231K(1518976K), 4.3639820 secs] [Times: user=8.76 sys=0.00, > real=4.37 secs] > > 2015-03-26T07:24:02.103+0200: 448.380: [GC2015-03-26T07:24:02.103+0200: > 448.380: [ParNew: 150320K->7374K(153344K), 4.4849670 secs] > 666551K->523605K(1518976K), 4.4850660 secs] [Times: user=9.00 sys=0.00, > real=4.49 secs] > > 2015-03-26T07:25:06.110+0200: 512.388: [GC2015-03-26T07:25:06.111+0200: > 512.388: [ParNew: 143694K->14118K(153344K), 4.4852630 secs] > 659925K->530349K(1518976K), 4.4853500 secs] [Times: user=9.00 sys=0.00, > real=4.48 secs] > > 2015-03-26T07:26:14.406+0200: 580.684: [GC2015-03-26T07:26:14.407+0200: > 580.684: [ParNew: 150438K->16563K(153344K), 11.0235000 secs] > 666669K->537384K(1518976K), 11.0236080 secs] [Times: user=22.12 sys=0.00, > real=11.02 secs] > > 2015-03-26T07:27:59.141+0200: 685.418: [GC2015-03-26T07:27:59.141+0200: > 685.418: [ParNew: 152883K->13870K(153344K), 13.1984820 secs] > 673704K->534691K(1518976K), 13.1985800 secs] [Times: user=26.44 sys=0.00, > real=13.20 secs] > > 2015-03-26T07:29:45.825+0200: 792.102: [GC2015-03-26T07:29:45.825+0200: > 792.102: [ParNew: 150190K->8749K(153344K), 16.4442890 secs] > 671011K->531093K(1518976K), 16.4443990 secs] [Times: user=32.92 sys=0.00, > real=16.44 secs] > > 2015-03-26T07:31:43.393+0200: 909.670: [GC2015-03-26T07:31:43.393+0200: > 909.670: [ParNew: 145069K->16666K(153344K), 17.1249960 secs] > 667413K->539010K(1518976K), 17.1250910 secs] [Times: user=34.28 sys=0.00, > real=17.13 secs] > > 2015-03-26T07:33:48.146+0200: 1034.423: [GC2015-03-26T07:33:48.146+0200: > 1034.423: [ParNew: 152986K->10212K(153344K), 23.7178960 secs] > 675330K->536361K(1518976K), 23.7180010 secs] [Times: user=47.45 sys=0.00, > real=23.71 secs] > > 2015-03-26T07:35:41.187+0200: 1147.464: [GC2015-03-26T07:35:41.187+0200: > 1147.464: [ParNew: 146532K->15576K(153344K), 25.4787270 secs] > 672681K->541725K(1518976K), 25.4788360 secs] [Times: user=50.97 sys=0.00, > real=25.48 secs] > > 2015-03-26T07:37:26.072+0200: 1252.349: [GC2015-03-26T07:37:26.072+0200: > 1252.349: [ParNew: 151896K->9807K(153344K), 30.2121220 secs] > 678045K->539111K(1518976K), 30.2122180 secs] [Times: user=60.42 sys=0.00, > real=30.22 secs] > > 2015-03-26T07:40:14.518+0200: 1420.795: [GC2015-03-26T07:40:14.518+0200: > 1420.795: [ParNew: 146127K->17024K(153344K), 32.0691660 secs] > 675431K->546342K(1518976K), 32.0692640 secs] [Times: user=64.11 sys=0.00, > real=32.07 secs] > > 2015-03-26T07:41:47.096+0200: 1513.373: [GC2015-03-26T07:41:47.096+0200: > 1513.373: [ParNew: 153344K->10479K(153344K), 40.0872490 secs] > 682662K->543861K(1518976K), 40.0873620 secs] [Times: user=80.16 sys=0.00, > real=40.09 secs] > > 2015-03-26T07:43:20.840+0200: 1607.117: [GC2015-03-26T07:43:20.840+0200: > 1607.118: [ParNew: 146799K->17024K(153344K), 41.4909450 secs] > 680181K->550413K(1518976K), 41.4910490 secs] [Times: user=82.97 sys=0.00, > real=41.49 secs] > > 2015-03-26T07:45:38.200+0200: 1744.477: [GC2015-03-26T07:45:38.200+0200: > 1744.477: [ParNew: 153344K->11552K(153344K), 44.5490790 secs] > 686733K->546792K(1518976K), 44.5491900 secs] [Times: user=89.08 sys=0.00, > real=44.55 secs] > > 2015-03-26T07:49:56.546+0200: 2002.823: [GC2015-03-26T07:49:56.546+0200: > 2002.823: [ParNew: 147872K->10963K(153344K), 45.4665010 secs] > 683112K->546204K(1518976K), 45.4666110 secs] [Times: user=90.91 sys=0.00, > real=45.46 secs] > > 2015-03-26T07:54:35.158+0200: 2281.435: [GC2015-03-26T07:54:35.158+0200: > 2281.435: [ParNew: 147283K->12000K(153344K), 45.3558830 secs] > 682524K->547241K(1518976K), 45.3559960 secs] [Times: user=90.68 sys=0.00, > real=45.35 secs] > > 2015-03-26T07:58:58.888+0200: 2545.165: [GC2015-03-26T07:58:58.888+0200: > 2545.165: [ParNew: 148320K->8521K(153344K), 50.8855520 secs] > 683561K->546682K(1518976K), 50.8856520 secs] [Times: user=101.73 sys=0.00, > real=50.88 secs] > > 2015-03-26T08:03:25.615+0200: 2811.892: [GC2015-03-26T08:03:25.615+0200: > 2811.892: [ParNew: 144841K->6887K(153344K), 56.1553500 secs] > 683002K->547305K(1518976K), 56.1554700 secs] [Times: user=112.26 sys=0.00, > real=56.16 secs] > > 2015-03-26T08:07:59.738+0200: 3086.015: [GC2015-03-26T08:07:59.738+0200: > 3086.015: [ParNew: 143207K->5223K(153344K), 11.1719880 secs] > 683625K->547498K(1518976K), 11.1720970 secs] [Times: user=22.33 sys=0.00, > real=11.17 secs] > > 2015-03-26T08:11:41.909+0200: 3308.186: [GC2015-03-26T08:11:41.909+0200: > 3308.186: [ParNew: 141543K->5778K(153344K), 0.0121180 secs] > 683818K->548443K(1518976K), 0.0121990 secs] [Times: user=0.02 sys=0.00, > real=0.01 secs] > > 2015-03-26T08:15:05.817+0200: 3512.094: [GC2015-03-26T08:15:05.817+0200: > 3512.094: [ParNew: 142098K->10943K(153344K), 0.0228440 secs] > 684763K->554166K(1518976K), 0.0229390 secs] [Times: user=0.04 sys=0.00, > real=0.02 secs] > > 2015-03-26T08:16:50.297+0200: 3616.574: [GC2015-03-26T08:16:50.297+0200: > 3616.574: [ParNew: 147263K->16262K(153344K), 0.0269190 secs] > 690486K->560618K(1518976K), 0.0270040 secs] [Times: user=0.05 sys=0.01, > real=0.02 secs] > > We know the EBS storage is less than ideal, but we've tested a similar > setup with significantly higher simulated load without any issues. > > Any suggestions on what to check, test etc? > > > Best regards, > > Timo >