Hi,

A related issue which results in a similar increasing GC cycles (but
doesn't recover)... We're trying to bootstrap a new broker into the cluster
(replacing one of the 3 main ones described previously). It connects fine,
starts replicating data and becomes ISR,  but starts a similar GC cycle on
increasing duration, e.g:

2015-03-30T20:05:26.927+0300: 13.356: [CMS-concurrent-reset-start]

2015-03-30T20:05:26.929+0300: 13.359: [CMS-concurrent-reset: 0.002/0.002
secs] [Times: user=0.00 sys=0.00, real=0.00 secs]

2015-03-30T20:05:27.768+0300: 14.198: [GC2015-03-30T20:05:27.769+0300:
14.198: [ParNew: 141729K->4172K(153344K), 3.1090220 secs]
657022K->519464K(1518976K), 3.1091410 secs] [Times: user=6.22 sys=0.00,
real=3.11 secs]

2015-03-30T20:05:31.749+0300: 18.178: [GC2015-03-30T20:05:31.749+0300:
18.178: [ParNew: 139718K->2145K(153344K), 3.1022510 secs]
655011K->517437K(1518976K), 3.1023670 secs] [Times: user=6.20 sys=0.00,
real=3.10 secs]

2015-03-30T20:05:36.411+0300: 22.841: [GC2015-03-30T20:05:36.411+0300:
22.841: [ParNew: 138465K->3230K(153344K), 3.0921260 secs]
653757K->518522K(1518976K), 3.0922300 secs] [Times: user=6.18 sys=0.01,
real=3.09 secs]

2015-03-30T20:05:40.392+0300: 26.821: [GC2015-03-30T20:05:40.392+0300:
26.821: [ParNew: 138209K->1170K(153344K), 3.3535970 secs]
653501K->516470K(1518976K), 3.9147070 secs] [Times: user=6.39 sys=0.00,
real=3.92 secs]

2015-03-30T20:05:45.251+0300: 31.680: [GC2015-03-30T20:05:45.251+0300:
31.680: [ParNew: 136773K->2394K(153344K), 3.1427200 secs]
652073K->517706K(1518976K), 3.1428120 secs] [Times: user=6.27 sys=0.00,
real=3.15 secs]

2015-03-30T20:05:50.294+0300: 36.723: [GC2015-03-30T20:05:51.469+0300:
37.899: [ParNew: 138208K->390K(153344K), 3.1697040 secs]
653519K->515701K(1518976K), 4.3452330 secs] [Times: user=6.29 sys=0.01,
real=4.34 secs]

2015-03-30T20:05:55.565+0300: 41.995: [GC2015-03-30T20:05:55.566+0300:
41.995: [ParNew: 136710K->3439K(153344K), 3.1681960 secs]
652021K->518769K(1518976K), 3.1682890 secs] [Times: user=6.32 sys=0.00,
real=3.17 secs]

2015-03-30T20:05:59.717+0300: 46.146: [GC2015-03-30T20:05:59.717+0300:
46.146: [ParNew: 138893K->2417K(153344K), 3.1984450 secs]
654224K->517770K(1518976K), 3.1985550 secs] [Times: user=6.40 sys=0.00,
real=3.20 secs]

2015-03-30T20:06:03.741+0300: 50.170: [GC2015-03-30T20:06:03.741+0300:
50.170: [ParNew: 138737K->1389K(153344K), 3.2550790 secs]
654090K->516778K(1518976K), 3.2551820 secs] [Times: user=6.51 sys=0.00,
real=3.26 secs]

2015-03-30T20:06:08.158+0300: 54.588: [GC2015-03-30T20:06:08.158+0300:
54.588: [ParNew: 137224K->2403K(153344K), 3.3705920 secs]
652613K->517813K(1518976K), 3.3707200 secs] [Times: user=6.59 sys=0.01,
real=3.37 secs]

....


This goes up to when the realtime of the cycle is 1 min or so and then the
server becomes unresponsive.

Any suggestions on what cloud be causing this?


Best regards,

Timo

On 26 March 2015 at 19:27, Timo Ahokas <timo.aho...@gmail.com> wrote:

> Hi,
>
> We are seeing really weird GC behaviour on our Kafka cluster. After a
> clean stop/start of the whole cluster, the servers go into a loop of
> increasingly slow GC cycles for an hour and then seem to recover to a
> normal state without any external changes to the servers or traffic. The
> load to the cluster is minimal - 10->100 req/s.
>
> Our setup:
>
> - Kafka 0.8.1.1
> - 3 brokers running in AWS (t2.medium instances with EBS storage)
> - 4GB memory with 1.5GB allocated to the Java heap
> - Under 10 topics, with replication factor of 3  & 1-10 partitions per
> topic
> - Ubuntu 14.04, Oracle JDK 1.7
>
> As mentioned, we start up the Kafka cluster and start running very low
> load to it (less than 100 messages/second produced, initially in the low
> tens). After a couple of minutes the brokers start running GC with each
> cycle taking longer (1s -> 60s), which results in the cluster moving
> partition leadership around (presumably as the leader becomes unresponsive
> during GC). Them nearly an hour later the GC times drop down and end up in
> the 10-20 ms range after a few cycles. And the load to the cluster was
> actually increasing at this time (but still probably less than 100 req/s).
> It doesn't seem to be directly related to an external resource (e.g EBS or
> instance problems) as we've seen this multiple times and with newly
> provisioned servers also.
>
> Here's an example of a GC log from one broker:
>
> Java HotSpot(TM) 64-Bit Server VM (24.76-b04) for linux-amd64 JRE
> (1.7.0_76-b13), built on Dec 18 2014 16:58:39 by "java_re" with gcc 4.3.0
> 20080428 (Red Hat 4.3.0-8)
>
> Memory: 4k page, physical 4046804k(2532400k free), swap 0k(0k free)
>
> CommandLine flags: -XX:+CMSClassUnloadingEnabled
> -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC
> -XX:InitialHeapSize=1572864000 -XX:+ManagementServer
> -XX:MaxHeapSize=1572864000 -XX:MaxNewSize=174485504
> -XX:MaxTenuringThreshold=6 -XX:NewSize=174485504 -XX:OldPLABSize=16
> -XX:OldSize=348971008 -XX:+PrintGC -XX:+PrintGCDateStamps
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops
> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>
> 2015-03-26T07:16:50.119+0200: 16.396: [GC [1 CMS-initial-mark:
> 512000K(1365632K)] 593874K(1518976K), 0.0188480 secs] [Times: user=0.02
> sys=0.00, real=0.02 secs]
>
> 2015-03-26T07:16:50.138+0200: 16.415: [CMS-concurrent-mark-start]
>
> 2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-mark: 0.042/0.044
> secs] [Times: user=0.06 sys=0.01, real=0.04 secs]
>
> 2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-preclean-start]
>
> 2015-03-26T07:16:50.202+0200: 16.479: [CMS-concurrent-preclean:
> 0.019/0.020 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
>
> 2015-03-26T07:16:50.202+0200: 16.479:
> [CMS-concurrent-abortable-preclean-start]
>
>  CMS: abort preclean due to time 2015-03-26T07:16:55.286+0200: 21.563:
> [CMS-concurrent-abortable-preclean: 0.178/5.085 secs] [Times: user=0.75
> sys=0.01, real=5.09 secs]
>
> 2015-03-26T07:16:55.287+0200: 21.564: [GC[YG occupancy: 106584 K (153344
> K)]2015-03-26T07:16:55.287+0200: 21.564: [GC2015-03-26T07:16:55.287+0200:
> 21.564: [ParNew: 106584K->10552K(153344K), 0.0189980 secs]
> 618584K->522552K(1518976K), 0.0190710 secs] [Times: user=0.04 sys=0.00,
> real=0.02 secs]
>
> 2015-03-26T07:16:55.306+0200: 21.583: [Rescan (parallel) , 0.0085400
> secs]2015-03-26T07:16:55.314+0200: 21.591: [weak refs processing, 0.0000230
> secs]2015-03-26T07:16:55.314+0200: 21.591: [class unloading, 0.0006920
> secs]2015-03-26T07:16:55.315+0200: 21.592: [scrub symbol table, 0.0011410
> secs]2015-03-26T07:16:55.316+0200: 21.593: [scrub string table, 0.0001530
> secs] [1 CMS-remark: 512000K(1365632K)] 522552K(1518976K), 0.0298860 secs]
> [Times: user=0.06 sys=0.00, real=0.03 secs]
>
> 2015-03-26T07:16:55.318+0200: 21.595: [CMS-concurrent-sweep-start]
>
> 2015-03-26T07:16:55.322+0200: 21.599: [CMS-concurrent-sweep: 0.004/0.004
> secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
>
> 2015-03-26T07:16:55.322+0200: 21.600: [CMS-concurrent-reset-start]
>
> 2015-03-26T07:16:55.329+0200: 21.606: [CMS-concurrent-reset: 0.006/0.007
> secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
>
> 2015-03-26T07:20:08.027+0200: 214.304: [GC2015-03-26T07:20:08.027+0200:
> 214.304: [ParNew: 146872K->2825K(153344K), 2.7813230 secs]
> 658872K->518034K(1518976K), 2.7814290 secs] [Times: user=5.58 sys=0.00,
> real=2.78 secs]
>
> 2015-03-26T07:20:56.462+0200: 262.739: [GC2015-03-26T07:20:56.462+0200:
> 262.739: [ParNew: 139145K->6899K(153344K), 2.9533990 secs]
> 654354K->522108K(1518976K), 2.9534960 secs] [Times: user=5.93 sys=0.00,
> real=2.95 secs]
>
> 2015-03-26T07:21:51.915+0200: 318.192: [GC2015-03-26T07:21:51.915+0200:
> 318.192: [ParNew: 143219K->14567K(153344K), 2.9688390 secs]
> 658428K->529776K(1518976K), 2.9689670 secs] [Times: user=5.97 sys=0.00,
> real=2.97 secs]
>
> 2015-03-26T07:22:55.621+0200: 381.898: [GC2015-03-26T07:22:55.621+0200:
> 381.898: [ParNew: 150887K->14000K(153344K), 4.3638880 secs]
> 666096K->530231K(1518976K), 4.3639820 secs] [Times: user=8.76 sys=0.00,
> real=4.37 secs]
>
> 2015-03-26T07:24:02.103+0200: 448.380: [GC2015-03-26T07:24:02.103+0200:
> 448.380: [ParNew: 150320K->7374K(153344K), 4.4849670 secs]
> 666551K->523605K(1518976K), 4.4850660 secs] [Times: user=9.00 sys=0.00,
> real=4.49 secs]
>
> 2015-03-26T07:25:06.110+0200: 512.388: [GC2015-03-26T07:25:06.111+0200:
> 512.388: [ParNew: 143694K->14118K(153344K), 4.4852630 secs]
> 659925K->530349K(1518976K), 4.4853500 secs] [Times: user=9.00 sys=0.00,
> real=4.48 secs]
>
> 2015-03-26T07:26:14.406+0200: 580.684: [GC2015-03-26T07:26:14.407+0200:
> 580.684: [ParNew: 150438K->16563K(153344K), 11.0235000 secs]
> 666669K->537384K(1518976K), 11.0236080 secs] [Times: user=22.12 sys=0.00,
> real=11.02 secs]
>
> 2015-03-26T07:27:59.141+0200: 685.418: [GC2015-03-26T07:27:59.141+0200:
> 685.418: [ParNew: 152883K->13870K(153344K), 13.1984820 secs]
> 673704K->534691K(1518976K), 13.1985800 secs] [Times: user=26.44 sys=0.00,
> real=13.20 secs]
>
> 2015-03-26T07:29:45.825+0200: 792.102: [GC2015-03-26T07:29:45.825+0200:
> 792.102: [ParNew: 150190K->8749K(153344K), 16.4442890 secs]
> 671011K->531093K(1518976K), 16.4443990 secs] [Times: user=32.92 sys=0.00,
> real=16.44 secs]
>
> 2015-03-26T07:31:43.393+0200: 909.670: [GC2015-03-26T07:31:43.393+0200:
> 909.670: [ParNew: 145069K->16666K(153344K), 17.1249960 secs]
> 667413K->539010K(1518976K), 17.1250910 secs] [Times: user=34.28 sys=0.00,
> real=17.13 secs]
>
> 2015-03-26T07:33:48.146+0200: 1034.423: [GC2015-03-26T07:33:48.146+0200:
> 1034.423: [ParNew: 152986K->10212K(153344K), 23.7178960 secs]
> 675330K->536361K(1518976K), 23.7180010 secs] [Times: user=47.45 sys=0.00,
> real=23.71 secs]
>
> 2015-03-26T07:35:41.187+0200: 1147.464: [GC2015-03-26T07:35:41.187+0200:
> 1147.464: [ParNew: 146532K->15576K(153344K), 25.4787270 secs]
> 672681K->541725K(1518976K), 25.4788360 secs] [Times: user=50.97 sys=0.00,
> real=25.48 secs]
>
> 2015-03-26T07:37:26.072+0200: 1252.349: [GC2015-03-26T07:37:26.072+0200:
> 1252.349: [ParNew: 151896K->9807K(153344K), 30.2121220 secs]
> 678045K->539111K(1518976K), 30.2122180 secs] [Times: user=60.42 sys=0.00,
> real=30.22 secs]
>
> 2015-03-26T07:40:14.518+0200: 1420.795: [GC2015-03-26T07:40:14.518+0200:
> 1420.795: [ParNew: 146127K->17024K(153344K), 32.0691660 secs]
> 675431K->546342K(1518976K), 32.0692640 secs] [Times: user=64.11 sys=0.00,
> real=32.07 secs]
>
> 2015-03-26T07:41:47.096+0200: 1513.373: [GC2015-03-26T07:41:47.096+0200:
> 1513.373: [ParNew: 153344K->10479K(153344K), 40.0872490 secs]
> 682662K->543861K(1518976K), 40.0873620 secs] [Times: user=80.16 sys=0.00,
> real=40.09 secs]
>
> 2015-03-26T07:43:20.840+0200: 1607.117: [GC2015-03-26T07:43:20.840+0200:
> 1607.118: [ParNew: 146799K->17024K(153344K), 41.4909450 secs]
> 680181K->550413K(1518976K), 41.4910490 secs] [Times: user=82.97 sys=0.00,
> real=41.49 secs]
>
> 2015-03-26T07:45:38.200+0200: 1744.477: [GC2015-03-26T07:45:38.200+0200:
> 1744.477: [ParNew: 153344K->11552K(153344K), 44.5490790 secs]
> 686733K->546792K(1518976K), 44.5491900 secs] [Times: user=89.08 sys=0.00,
> real=44.55 secs]
>
> 2015-03-26T07:49:56.546+0200: 2002.823: [GC2015-03-26T07:49:56.546+0200:
> 2002.823: [ParNew: 147872K->10963K(153344K), 45.4665010 secs]
> 683112K->546204K(1518976K), 45.4666110 secs] [Times: user=90.91 sys=0.00,
> real=45.46 secs]
>
> 2015-03-26T07:54:35.158+0200: 2281.435: [GC2015-03-26T07:54:35.158+0200:
> 2281.435: [ParNew: 147283K->12000K(153344K), 45.3558830 secs]
> 682524K->547241K(1518976K), 45.3559960 secs] [Times: user=90.68 sys=0.00,
> real=45.35 secs]
>
> 2015-03-26T07:58:58.888+0200: 2545.165: [GC2015-03-26T07:58:58.888+0200:
> 2545.165: [ParNew: 148320K->8521K(153344K), 50.8855520 secs]
> 683561K->546682K(1518976K), 50.8856520 secs] [Times: user=101.73 sys=0.00,
> real=50.88 secs]
>
> 2015-03-26T08:03:25.615+0200: 2811.892: [GC2015-03-26T08:03:25.615+0200:
> 2811.892: [ParNew: 144841K->6887K(153344K), 56.1553500 secs]
> 683002K->547305K(1518976K), 56.1554700 secs] [Times: user=112.26 sys=0.00,
> real=56.16 secs]
>
> 2015-03-26T08:07:59.738+0200: 3086.015: [GC2015-03-26T08:07:59.738+0200:
> 3086.015: [ParNew: 143207K->5223K(153344K), 11.1719880 secs]
> 683625K->547498K(1518976K), 11.1720970 secs] [Times: user=22.33 sys=0.00,
> real=11.17 secs]
>
> 2015-03-26T08:11:41.909+0200: 3308.186: [GC2015-03-26T08:11:41.909+0200:
> 3308.186: [ParNew: 141543K->5778K(153344K), 0.0121180 secs]
> 683818K->548443K(1518976K), 0.0121990 secs] [Times: user=0.02 sys=0.00,
> real=0.01 secs]
>
> 2015-03-26T08:15:05.817+0200: 3512.094: [GC2015-03-26T08:15:05.817+0200:
> 3512.094: [ParNew: 142098K->10943K(153344K), 0.0228440 secs]
> 684763K->554166K(1518976K), 0.0229390 secs] [Times: user=0.04 sys=0.00,
> real=0.02 secs]
>
> 2015-03-26T08:16:50.297+0200: 3616.574: [GC2015-03-26T08:16:50.297+0200:
> 3616.574: [ParNew: 147263K->16262K(153344K), 0.0269190 secs]
> 690486K->560618K(1518976K), 0.0270040 secs] [Times: user=0.05 sys=0.01,
> real=0.02 secs]
>
> We know the EBS storage is less than ideal, but we've tested a similar
> setup with significantly higher simulated load without any issues.
>
> Any suggestions on what to check, test etc?
>
>
> Best regards,
>
> Timo
>

Reply via email to