A few more details before the looming crash. The broker replicating data
and in sync (before GC cycle starts) - GC cycles taking nearly 1min real:

2015-03-30T21:08:39.502+0300: 3805.931: [GC2015-03-30T21:08:39.502+0300:
3805.932: [ParNew: 142199K->6219K(153344K), 51.6581120 secs]
678471K->543078K(1518976K), 51.6582350 secs] [Times: user=103.29 sys=0.00,
real=51.66 secs]

2015-03-30T21:10:31.844+0300: 3918.273: [GC2015-03-30T21:10:31.844+0300:
3918.273: [ParNew: 142539K->4773K(153344K), 53.1631370 secs]
679398K->542216K(1518976K), 53.1632500 secs] [Times: user=106.30 sys=0.00,
real=53.17 secs]

2015-03-30T21:12:20.659+0300: 4027.088: [GC2015-03-30T21:12:20.659+0300:
4027.088: [ParNew: 141093K->2869K(153344K), 54.4376270 secs]
678536K->540931K(1518976K), 54.4377530 secs] [Times: user=108.85 sys=0.00,
real=54.44 secs]

2015-03-30T21:14:12.462+0300: 4138.891: [GC2015-03-30T21:14:12.462+0300:
4138.891: [ParNew: 139189K->3812K(153344K), 56.1337310 secs]
677251K->542473K(1518976K), 56.1338620 secs] [Times: user=112.22 sys=0.00,
real=56.14 secs]

2015-03-30T21:16:02.699+0300: 4249.128: [GC2015-03-30T21:16:02.699+0300:
4249.128: [ParNew: 140132K->2016K(153344K), 57.0575330 secs]
678793K->541266K(1518976K), 57.0576620 secs] [Times: user=114.07 sys=0.00,
real=57.06 secs]

Top shows Kafka/Java taking up both cores and all memory is used up (1.5GB
allocated to Kafka out of 4GB on the instance):

top - 21:18:35 up  1:20,  3 users,  load average: 1.28, 1.04, 0.94

Tasks: 130 total,   2 running, 128 sleeping,   0 stopped,   0 zombie

%Cpu(s): 99.7 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,
 0.3 st

KiB Mem:   4046804 total,  3898580 used,   148224 free,    18748 buffers

KiB Swap:        0 total,        0 used,        0 free.  2868556 cached Mem

 PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND


13943 kafka     20   0 4915068 813388  12648 S 199.7 20.1  66:37.21 java


Kafka startup params as follows:

java -Xmx1500M -Xms1500M -server -XX:+UseCompressedOops -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled
-XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC
-Djava.awt.headless=true -Xloggc:/opt/kafka/bin/../logs/kafkaServer-gc.log
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps
-XX:+PrintGCTimeStamps -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.port=9095
-Dkafka.logs.dir=/opt/kafka/bin/../logs
-Dlog4j.configuration=file:/opt/kafka/bin/../config/log4j.properties -cp
:/opt/kafka/bin/../core/build/dependant-libs-2.8.0/*.jar:/opt/kafka/bin/../perf/build/libs//kafka-perf_2.8.0*.jar:/opt/kafka/bin/../clients/build/libs//kafka-clients*.jar:/opt/kafka/bin/../examples/build/libs//kafka-examples*.jar:/opt/kafka/bin/../contrib/hadoop-consumer/build/libs//kafka-hadoop-consumer*.jar:/opt/kafka/bin/../contrib/hadoop-producer/build/libs//kafka-hadoop-producer*.jar:/opt/kafka/bin/../libs/jopt-simple-3.2.jar:/opt/kafka/bin/../libs/kafka_2.9.2-0.8.1.1-javadoc.jar:/opt/kafka/bin/../libs/kafka_2.9.2-0.8.1.1-scaladoc.jar:/opt/kafka/bin/../libs/kafka_2.9.2-0.8.1.1-sources.jar:/opt/kafka/bin/../libs/kafka_2.9.2-0.8.1.1.jar:/opt/kafka/bin/../libs/log4j-1.2.15.jar:/opt/kafka/bin/../libs/metrics-core-2.2.0.jar:/opt/kafka/bin/../libs/scala-library-2.9.2.jar:/opt/kafka/bin/../libs/slf4j-api-1.7.2.jar:/opt/kafka/bin/../libs/snappy-java-1.0.5.jar:/opt/kafka/bin/../libs/zkclient-0.3.jar:/opt/kafka/bin/../libs/zookeeper-3.3.4.jar:/opt/kafka/bin/../core/build/libs/kafka_2.8.0*.jar
kafka.Kafka /opt/kafka/config/server.properties


Any suggestions more than welcome...


Best regards,

Timo

On 30 March 2015 at 20:34, Timo Ahokas <timo.aho...@gmail.com> wrote:

> Hi,
>
> A related issue which results in a similar increasing GC cycles (but
> doesn't recover)... We're trying to bootstrap a new broker into the cluster
> (replacing one of the 3 main ones described previously). It connects fine,
> starts replicating data and becomes ISR,  but starts a similar GC cycle on
> increasing duration, e.g:
>
> 2015-03-30T20:05:26.927+0300: 13.356: [CMS-concurrent-reset-start]
>
> 2015-03-30T20:05:26.929+0300: 13.359: [CMS-concurrent-reset: 0.002/0.002
> secs] [Times: user=0.00 sys=0.00, real=0.00 secs]
>
> 2015-03-30T20:05:27.768+0300: 14.198: [GC2015-03-30T20:05:27.769+0300:
> 14.198: [ParNew: 141729K->4172K(153344K), 3.1090220 secs]
> 657022K->519464K(1518976K), 3.1091410 secs] [Times: user=6.22 sys=0.00,
> real=3.11 secs]
>
> 2015-03-30T20:05:31.749+0300: 18.178: [GC2015-03-30T20:05:31.749+0300:
> 18.178: [ParNew: 139718K->2145K(153344K), 3.1022510 secs]
> 655011K->517437K(1518976K), 3.1023670 secs] [Times: user=6.20 sys=0.00,
> real=3.10 secs]
>
> 2015-03-30T20:05:36.411+0300: 22.841: [GC2015-03-30T20:05:36.411+0300:
> 22.841: [ParNew: 138465K->3230K(153344K), 3.0921260 secs]
> 653757K->518522K(1518976K), 3.0922300 secs] [Times: user=6.18 sys=0.01,
> real=3.09 secs]
>
> 2015-03-30T20:05:40.392+0300: 26.821: [GC2015-03-30T20:05:40.392+0300:
> 26.821: [ParNew: 138209K->1170K(153344K), 3.3535970 secs]
> 653501K->516470K(1518976K), 3.9147070 secs] [Times: user=6.39 sys=0.00,
> real=3.92 secs]
>
> 2015-03-30T20:05:45.251+0300: 31.680: [GC2015-03-30T20:05:45.251+0300:
> 31.680: [ParNew: 136773K->2394K(153344K), 3.1427200 secs]
> 652073K->517706K(1518976K), 3.1428120 secs] [Times: user=6.27 sys=0.00,
> real=3.15 secs]
>
> 2015-03-30T20:05:50.294+0300: 36.723: [GC2015-03-30T20:05:51.469+0300:
> 37.899: [ParNew: 138208K->390K(153344K), 3.1697040 secs]
> 653519K->515701K(1518976K), 4.3452330 secs] [Times: user=6.29 sys=0.01,
> real=4.34 secs]
>
> 2015-03-30T20:05:55.565+0300: 41.995: [GC2015-03-30T20:05:55.566+0300:
> 41.995: [ParNew: 136710K->3439K(153344K), 3.1681960 secs]
> 652021K->518769K(1518976K), 3.1682890 secs] [Times: user=6.32 sys=0.00,
> real=3.17 secs]
>
> 2015-03-30T20:05:59.717+0300: 46.146: [GC2015-03-30T20:05:59.717+0300:
> 46.146: [ParNew: 138893K->2417K(153344K), 3.1984450 secs]
> 654224K->517770K(1518976K), 3.1985550 secs] [Times: user=6.40 sys=0.00,
> real=3.20 secs]
>
> 2015-03-30T20:06:03.741+0300: 50.170: [GC2015-03-30T20:06:03.741+0300:
> 50.170: [ParNew: 138737K->1389K(153344K), 3.2550790 secs]
> 654090K->516778K(1518976K), 3.2551820 secs] [Times: user=6.51 sys=0.00,
> real=3.26 secs]
>
> 2015-03-30T20:06:08.158+0300: 54.588: [GC2015-03-30T20:06:08.158+0300:
> 54.588: [ParNew: 137224K->2403K(153344K), 3.3705920 secs]
> 652613K->517813K(1518976K), 3.3707200 secs] [Times: user=6.59 sys=0.01,
> real=3.37 secs]
>
> ....
>
>
> This goes up to when the realtime of the cycle is 1 min or so and then the
> server becomes unresponsive.
>
> Any suggestions on what cloud be causing this?
>
>
> Best regards,
>
> Timo
>
> On 26 March 2015 at 19:27, Timo Ahokas <timo.aho...@gmail.com> wrote:
>
>> Hi,
>>
>> We are seeing really weird GC behaviour on our Kafka cluster. After a
>> clean stop/start of the whole cluster, the servers go into a loop of
>> increasingly slow GC cycles for an hour and then seem to recover to a
>> normal state without any external changes to the servers or traffic. The
>> load to the cluster is minimal - 10->100 req/s.
>>
>> Our setup:
>>
>> - Kafka 0.8.1.1
>> - 3 brokers running in AWS (t2.medium instances with EBS storage)
>> - 4GB memory with 1.5GB allocated to the Java heap
>> - Under 10 topics, with replication factor of 3  & 1-10 partitions per
>> topic
>> - Ubuntu 14.04, Oracle JDK 1.7
>>
>> As mentioned, we start up the Kafka cluster and start running very low
>> load to it (less than 100 messages/second produced, initially in the low
>> tens). After a couple of minutes the brokers start running GC with each
>> cycle taking longer (1s -> 60s), which results in the cluster moving
>> partition leadership around (presumably as the leader becomes unresponsive
>> during GC). Them nearly an hour later the GC times drop down and end up in
>> the 10-20 ms range after a few cycles. And the load to the cluster was
>> actually increasing at this time (but still probably less than 100 req/s).
>> It doesn't seem to be directly related to an external resource (e.g EBS or
>> instance problems) as we've seen this multiple times and with newly
>> provisioned servers also.
>>
>> Here's an example of a GC log from one broker:
>>
>> Java HotSpot(TM) 64-Bit Server VM (24.76-b04) for linux-amd64 JRE
>> (1.7.0_76-b13), built on Dec 18 2014 16:58:39 by "java_re" with gcc 4.3.0
>> 20080428 (Red Hat 4.3.0-8)
>>
>> Memory: 4k page, physical 4046804k(2532400k free), swap 0k(0k free)
>>
>> CommandLine flags: -XX:+CMSClassUnloadingEnabled
>> -XX:+CMSScavengeBeforeRemark -XX:+DisableExplicitGC
>> -XX:InitialHeapSize=1572864000 -XX:+ManagementServer
>> -XX:MaxHeapSize=1572864000 -XX:MaxNewSize=174485504
>> -XX:MaxTenuringThreshold=6 -XX:NewSize=174485504 -XX:OldPLABSize=16
>> -XX:OldSize=348971008 -XX:+PrintGC -XX:+PrintGCDateStamps
>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseCompressedOops
>> -XX:+UseConcMarkSweepGC -XX:+UseParNewGC
>>
>> 2015-03-26T07:16:50.119+0200: 16.396: [GC [1 CMS-initial-mark:
>> 512000K(1365632K)] 593874K(1518976K), 0.0188480 secs] [Times: user=0.02
>> sys=0.00, real=0.02 secs]
>>
>> 2015-03-26T07:16:50.138+0200: 16.415: [CMS-concurrent-mark-start]
>>
>> 2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-mark: 0.042/0.044
>> secs] [Times: user=0.06 sys=0.01, real=0.04 secs]
>>
>> 2015-03-26T07:16:50.182+0200: 16.459: [CMS-concurrent-preclean-start]
>>
>> 2015-03-26T07:16:50.202+0200: 16.479: [CMS-concurrent-preclean:
>> 0.019/0.020 secs] [Times: user=0.03 sys=0.01, real=0.02 secs]
>>
>> 2015-03-26T07:16:50.202+0200: 16.479:
>> [CMS-concurrent-abortable-preclean-start]
>>
>>  CMS: abort preclean due to time 2015-03-26T07:16:55.286+0200: 21.563:
>> [CMS-concurrent-abortable-preclean: 0.178/5.085 secs] [Times: user=0.75
>> sys=0.01, real=5.09 secs]
>>
>> 2015-03-26T07:16:55.287+0200: 21.564: [GC[YG occupancy: 106584 K (153344
>> K)]2015-03-26T07:16:55.287+0200: 21.564: [GC2015-03-26T07:16:55.287+0200:
>> 21.564: [ParNew: 106584K->10552K(153344K), 0.0189980 secs]
>> 618584K->522552K(1518976K), 0.0190710 secs] [Times: user=0.04 sys=0.00,
>> real=0.02 secs]
>>
>> 2015-03-26T07:16:55.306+0200: 21.583: [Rescan (parallel) , 0.0085400
>> secs]2015-03-26T07:16:55.314+0200: 21.591: [weak refs processing, 0.0000230
>> secs]2015-03-26T07:16:55.314+0200: 21.591: [class unloading, 0.0006920
>> secs]2015-03-26T07:16:55.315+0200: 21.592: [scrub symbol table, 0.0011410
>> secs]2015-03-26T07:16:55.316+0200: 21.593: [scrub string table, 0.0001530
>> secs] [1 CMS-remark: 512000K(1365632K)] 522552K(1518976K), 0.0298860 secs]
>> [Times: user=0.06 sys=0.00, real=0.03 secs]
>>
>> 2015-03-26T07:16:55.318+0200: 21.595: [CMS-concurrent-sweep-start]
>>
>> 2015-03-26T07:16:55.322+0200: 21.599: [CMS-concurrent-sweep: 0.004/0.004
>> secs] [Times: user=0.01 sys=0.00, real=0.00 secs]
>>
>> 2015-03-26T07:16:55.322+0200: 21.600: [CMS-concurrent-reset-start]
>>
>> 2015-03-26T07:16:55.329+0200: 21.606: [CMS-concurrent-reset: 0.006/0.007
>> secs] [Times: user=0.00 sys=0.01, real=0.01 secs]
>>
>> 2015-03-26T07:20:08.027+0200: 214.304: [GC2015-03-26T07:20:08.027+0200:
>> 214.304: [ParNew: 146872K->2825K(153344K), 2.7813230 secs]
>> 658872K->518034K(1518976K), 2.7814290 secs] [Times: user=5.58 sys=0.00,
>> real=2.78 secs]
>>
>> 2015-03-26T07:20:56.462+0200: 262.739: [GC2015-03-26T07:20:56.462+0200:
>> 262.739: [ParNew: 139145K->6899K(153344K), 2.9533990 secs]
>> 654354K->522108K(1518976K), 2.9534960 secs] [Times: user=5.93 sys=0.00,
>> real=2.95 secs]
>>
>> 2015-03-26T07:21:51.915+0200: 318.192: [GC2015-03-26T07:21:51.915+0200:
>> 318.192: [ParNew: 143219K->14567K(153344K), 2.9688390 secs]
>> 658428K->529776K(1518976K), 2.9689670 secs] [Times: user=5.97 sys=0.00,
>> real=2.97 secs]
>>
>> 2015-03-26T07:22:55.621+0200: 381.898: [GC2015-03-26T07:22:55.621+0200:
>> 381.898: [ParNew: 150887K->14000K(153344K), 4.3638880 secs]
>> 666096K->530231K(1518976K), 4.3639820 secs] [Times: user=8.76 sys=0.00,
>> real=4.37 secs]
>>
>> 2015-03-26T07:24:02.103+0200: 448.380: [GC2015-03-26T07:24:02.103+0200:
>> 448.380: [ParNew: 150320K->7374K(153344K), 4.4849670 secs]
>> 666551K->523605K(1518976K), 4.4850660 secs] [Times: user=9.00 sys=0.00,
>> real=4.49 secs]
>>
>> 2015-03-26T07:25:06.110+0200: 512.388: [GC2015-03-26T07:25:06.111+0200:
>> 512.388: [ParNew: 143694K->14118K(153344K), 4.4852630 secs]
>> 659925K->530349K(1518976K), 4.4853500 secs] [Times: user=9.00 sys=0.00,
>> real=4.48 secs]
>>
>> 2015-03-26T07:26:14.406+0200: 580.684: [GC2015-03-26T07:26:14.407+0200:
>> 580.684: [ParNew: 150438K->16563K(153344K), 11.0235000 secs]
>> 666669K->537384K(1518976K), 11.0236080 secs] [Times: user=22.12 sys=0.00,
>> real=11.02 secs]
>>
>> 2015-03-26T07:27:59.141+0200: 685.418: [GC2015-03-26T07:27:59.141+0200:
>> 685.418: [ParNew: 152883K->13870K(153344K), 13.1984820 secs]
>> 673704K->534691K(1518976K), 13.1985800 secs] [Times: user=26.44 sys=0.00,
>> real=13.20 secs]
>>
>> 2015-03-26T07:29:45.825+0200: 792.102: [GC2015-03-26T07:29:45.825+0200:
>> 792.102: [ParNew: 150190K->8749K(153344K), 16.4442890 secs]
>> 671011K->531093K(1518976K), 16.4443990 secs] [Times: user=32.92 sys=0.00,
>> real=16.44 secs]
>>
>> 2015-03-26T07:31:43.393+0200: 909.670: [GC2015-03-26T07:31:43.393+0200:
>> 909.670: [ParNew: 145069K->16666K(153344K), 17.1249960 secs]
>> 667413K->539010K(1518976K), 17.1250910 secs] [Times: user=34.28 sys=0.00,
>> real=17.13 secs]
>>
>> 2015-03-26T07:33:48.146+0200: 1034.423: [GC2015-03-26T07:33:48.146+0200:
>> 1034.423: [ParNew: 152986K->10212K(153344K), 23.7178960 secs]
>> 675330K->536361K(1518976K), 23.7180010 secs] [Times: user=47.45 sys=0.00,
>> real=23.71 secs]
>>
>> 2015-03-26T07:35:41.187+0200: 1147.464: [GC2015-03-26T07:35:41.187+0200:
>> 1147.464: [ParNew: 146532K->15576K(153344K), 25.4787270 secs]
>> 672681K->541725K(1518976K), 25.4788360 secs] [Times: user=50.97 sys=0.00,
>> real=25.48 secs]
>>
>> 2015-03-26T07:37:26.072+0200: 1252.349: [GC2015-03-26T07:37:26.072+0200:
>> 1252.349: [ParNew: 151896K->9807K(153344K), 30.2121220 secs]
>> 678045K->539111K(1518976K), 30.2122180 secs] [Times: user=60.42 sys=0.00,
>> real=30.22 secs]
>>
>> 2015-03-26T07:40:14.518+0200: 1420.795: [GC2015-03-26T07:40:14.518+0200:
>> 1420.795: [ParNew: 146127K->17024K(153344K), 32.0691660 secs]
>> 675431K->546342K(1518976K), 32.0692640 secs] [Times: user=64.11 sys=0.00,
>> real=32.07 secs]
>>
>> 2015-03-26T07:41:47.096+0200: 1513.373: [GC2015-03-26T07:41:47.096+0200:
>> 1513.373: [ParNew: 153344K->10479K(153344K), 40.0872490 secs]
>> 682662K->543861K(1518976K), 40.0873620 secs] [Times: user=80.16 sys=0.00,
>> real=40.09 secs]
>>
>> 2015-03-26T07:43:20.840+0200: 1607.117: [GC2015-03-26T07:43:20.840+0200:
>> 1607.118: [ParNew: 146799K->17024K(153344K), 41.4909450 secs]
>> 680181K->550413K(1518976K), 41.4910490 secs] [Times: user=82.97 sys=0.00,
>> real=41.49 secs]
>>
>> 2015-03-26T07:45:38.200+0200: 1744.477: [GC2015-03-26T07:45:38.200+0200:
>> 1744.477: [ParNew: 153344K->11552K(153344K), 44.5490790 secs]
>> 686733K->546792K(1518976K), 44.5491900 secs] [Times: user=89.08 sys=0.00,
>> real=44.55 secs]
>>
>> 2015-03-26T07:49:56.546+0200: 2002.823: [GC2015-03-26T07:49:56.546+0200:
>> 2002.823: [ParNew: 147872K->10963K(153344K), 45.4665010 secs]
>> 683112K->546204K(1518976K), 45.4666110 secs] [Times: user=90.91 sys=0.00,
>> real=45.46 secs]
>>
>> 2015-03-26T07:54:35.158+0200: 2281.435: [GC2015-03-26T07:54:35.158+0200:
>> 2281.435: [ParNew: 147283K->12000K(153344K), 45.3558830 secs]
>> 682524K->547241K(1518976K), 45.3559960 secs] [Times: user=90.68 sys=0.00,
>> real=45.35 secs]
>>
>> 2015-03-26T07:58:58.888+0200: 2545.165: [GC2015-03-26T07:58:58.888+0200:
>> 2545.165: [ParNew: 148320K->8521K(153344K), 50.8855520 secs]
>> 683561K->546682K(1518976K), 50.8856520 secs] [Times: user=101.73 sys=0.00,
>> real=50.88 secs]
>>
>> 2015-03-26T08:03:25.615+0200: 2811.892: [GC2015-03-26T08:03:25.615+0200:
>> 2811.892: [ParNew: 144841K->6887K(153344K), 56.1553500 secs]
>> 683002K->547305K(1518976K), 56.1554700 secs] [Times: user=112.26 sys=0.00,
>> real=56.16 secs]
>>
>> 2015-03-26T08:07:59.738+0200: 3086.015: [GC2015-03-26T08:07:59.738+0200:
>> 3086.015: [ParNew: 143207K->5223K(153344K), 11.1719880 secs]
>> 683625K->547498K(1518976K), 11.1720970 secs] [Times: user=22.33 sys=0.00,
>> real=11.17 secs]
>>
>> 2015-03-26T08:11:41.909+0200: 3308.186: [GC2015-03-26T08:11:41.909+0200:
>> 3308.186: [ParNew: 141543K->5778K(153344K), 0.0121180 secs]
>> 683818K->548443K(1518976K), 0.0121990 secs] [Times: user=0.02 sys=0.00,
>> real=0.01 secs]
>>
>> 2015-03-26T08:15:05.817+0200: 3512.094: [GC2015-03-26T08:15:05.817+0200:
>> 3512.094: [ParNew: 142098K->10943K(153344K), 0.0228440 secs]
>> 684763K->554166K(1518976K), 0.0229390 secs] [Times: user=0.04 sys=0.00,
>> real=0.02 secs]
>>
>> 2015-03-26T08:16:50.297+0200: 3616.574: [GC2015-03-26T08:16:50.297+0200:
>> 3616.574: [ParNew: 147263K->16262K(153344K), 0.0269190 secs]
>> 690486K->560618K(1518976K), 0.0270040 secs] [Times: user=0.05 sys=0.01,
>> real=0.02 secs]
>>
>> We know the EBS storage is less than ideal, but we've tested a similar
>> setup with significantly higher simulated load without any issues.
>>
>> Any suggestions on what to check, test etc?
>>
>>
>> Best regards,
>>
>> Timo
>>
>
>

Reply via email to