[ https://issues.apache.org/jira/browse/KAFKA-4464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15706283#comment-15706283 ]
Xavier Lange commented on KAFKA-4464: ------------------------------------- Here is my kafka broker config: {code} kafka@86a156fd9dda:~$ cat /kafka/config/server.properties # Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # see kafka.server.KafkaConfig for additional details and defaults ############################# Server Basics ############################# # The id of the broker. This must be set to a unique integer for each broker. broker.id=1 auto.leader.rebalance.enable=true # Replication auto.create.topics.enable=true default.replication.factor=2 # Hostname the broker will advertise to consumers. If not set, kafka will use the value returned # from InetAddress.getLocalHost(). If there are multiple interfaces getLocalHost # may not be what you want. advertised.host.name=10.60.68.122 # Enable topic deletion delete.topic.enable=true ############################# Socket Server Settings ############################# # The port the socket server listens on port=9092 advertised.port=9092 num.io.threads=8 num.network.threads=8 socket.request.max.bytes=104857600 socket.receive.buffer.bytes=1048576 socket.send.buffer.bytes=1048576 queued.max.requests=16 fetch.purgatory.purge.interval.requests=100 producer.purgatory.purge.interval.requests=100 ############################# Replication Settings ############################# num.replica.fetchers=4 ############################# Log Basics ############################# # The directory under which to store log files log.dir=/data log.dirs=/data # The number of logical partitions per topic per server. More partitions allow greater parallelism # for consumption, but also mean more files. num.partitions=20 num.network.threads=20 ############################# Log Retention Policy ############################# # The following configurations control the disposal of log segments. The policy can # be set to delete segments after a period of time, or after a given size has accumulated. # A segment will be deleted whenever *either* of these criteria are met. Deletion always happens # from the end of the log. # The minimum age of a log file to be eligible for deletion # log.retention.hours=168 # 10 years log.retention.hours=87600 ############################# Zookeeper ############################# # Zk connection string (see zk docs for details). # This is a comma separated host:port pairs, each corresponding to a zk # server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002". # You can also append an optional chroot string to the urls to specify the # root directory for all kafka znodes. zookeeper.connect=itsecmon-zk1.usw1.viasat.cloud:2181,itsecmon-zk2.usw1.viasat.cloud:2181,itsecmon-zk3.usw1.viasat.cloud:2181,itsecmon-zk4.usw1.viasat.cloud:2181,itsecmon-zk5.usw1.viasat.cloud:2181 zookeeper.connection.timeout.ms=10000 controlled.shutdown.enable=true zookeeper.session.timeout.ms=10000 # vim:set filetype=jproperties {code} > Clean shutdown of broker fails due to controller error > ------------------------------------------------------ > > Key: KAFKA-4464 > URL: https://issues.apache.org/jira/browse/KAFKA-4464 > Project: Kafka > Issue Type: Bug > Components: controller > Affects Versions: 0.10.1.0 > Environment: kafka@86a156fd9dda:~$ java -version > java version "1.8.0_60" > Java(TM) SE Runtime Environment (build 1.8.0_60-b27) > Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode) > kafka@86a156fd9dda:~$ uname -a > Linux 86a156fd9dda 4.7.3-coreos-r2 #1 SMP Tue Nov 1 01:38:43 UTC 2016 x86_64 > x86_64 x86_64 GNU/Linux > kafka@86a156fd9dda:~$ ps alx | grep java > 4 1000 1 0 20 0 75887304 3820220 futex_ Ssl ? 9379:49 > /usr/lib/jvm/java-8-oracle/bin/java -Xmx3G -Xms3G -server -XX:+UseG1GC > -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 > -XX:+DisableExplicitGC -Djava.awt.headless=true > -Xloggc:/kafka/bin/../logs/kafkaServer-gc.log -verbose:gc -XX:+PrintGCDetails > -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps > -Dcom.sun.management.jmxremote=true > -Dcom.sun.management.jmxremote.authenticate=false > -Dcom.sun.management.jmxremote.ssl=false > -Djava.rmi.server.hostname=10.60.68.122 > -Dcom.sun.management.jmxremote.rmi.port=7203 -Djava.net.preferIPv4Stack=true > -Dcom.sun.management.jmxremote.port=7203 -Dkafka.logs.dir=/kafka/bin/../logs > -Dlog4j.configuration=file:/kafka/bin/../config/log4j.properties -cp > :/kafka/bin/../libs/aopalliance-repackaged-2.4.0-b34.jar:/kafka/bin/../libs/argparse4j-0.5.0.jar:/kafka/bin/../libs/connect-api-0.10.1.0.jar:/kafka/bin/../libs/connect-file-0.10.1.0.jar:/kafka/bin/../libs/connect-json-0.10.1.0.jar:/kafka/bin/../libs/connect-runtime-0.10.1.0.jar:/kafka/bin/../libs/guava-18.0.jar:/kafka/bin/../libs/hk2-api-2.4.0-b34.jar:/kafka/bin/../libs/hk2-locator-2.4.0-b34.jar:/kafka/bin/../libs/hk2-utils-2.4.0-b34.jar:/kafka/bin/../libs/jackson-annotations-2.6.0.jar:/kafka/bin/../libs/jackson-core-2.6.3.jar:/kafka/bin/../libs/jackson-databind-2.6.3.jar:/kafka/bin/../libs/jackson-jaxrs-base-2.6.3.jar:/kafka/bin/../libs/jackson-jaxrs-json-provider-2.6.3.jar:/kafka/bin/../libs/jackson-module-jaxb-annotations-2.6.3.jar:/kafka/bin/../libs/javassist-3.18.2-GA.jar:/kafka/bin/../libs/javax.annotation-api-1.2.jar:/kafka/bin/../libs/javax.inject-1.jar:/kafka/bin/../libs/javax.inject-2.4.0-b34.jar:/kafka/bin/../libs/javax.servlet-api-3.1.0.jar:/kafka/bin/../libs/javax.ws.rs-api-2.0.1.jar:/kafka/bin/../libs/jersey-client-2.22.2.jar:/kafka/bin/../libs/jersey-common-2.22.2.jar:/kafka/bin/../libs/jersey-container-servlet-2.22.2.jar:/kafka/bin/../libs/jersey-container-servlet-core-2.22.2.jar:/kafka/bin/../libs/jersey-guava-2.22.2.jar:/kafka/bin/../libs/jersey-media-jaxb-2.22.2.jar:/kafka/bin/../libs/jersey-server-2.22.2.jar:/kafka/bin/../libs/jetty-continuation-9.2.15.v20160210.jar:/kafka/bin/../libs/jetty-http-9.2.15.v20160210.jar:/kafka/bin/../libs/jetty-io-9.2.15.v20160210.jar:/kafka/bin/../libs/jetty-security-9.2.15.v20160210.jar:/kafka/bin/../libs/jetty-server-9.2.15.v20160210.jar:/kafka/bin/../libs/jetty-servlet-9.2.15.v20160210.jar:/kafka/bin/../libs/jetty-servlets-9.2.15.v20160210.jar:/kafka/bin/../libs/jetty-util-9.2.15.v20160210.jar:/kafka/bin/../libs/jopt-simple-4.9.jar:/kafka/bin/../libs/kafka-clients-0.10.1.0.jar:/kafka/bin/../libs/kafka-log4j-appender-0.10.1.0.jar:/kafka/bin/../libs/kafka-streams-0.10.1.0.jar:/kafka/bin/../libs/kafka-streams-examples-0.10.1.0.jar:/kafka/bin/../libs/kafka-tools-0.10.1.0.jar:/kafka/bin/../libs/kafka_2.11-0.10.1.0-sources.jar:/kafka/bin/../libs/kafka_2.11-0.10.1.0-test-sources.jar:/kafka/bin/../libs/kafka_2.11-0.10.1.0.jar:/kafka/bin/../libs/log4j-1.2.17.jar:/kafka/bin/../libs/lz4-1.3.0.jar:/kafka/bin/../libs/metrics-core-2.2.0.jar:/kafka/bin/../libs/osgi-resource-locator-1.0.1.jar:/kafka/bin/../libs/reflections-0.9.10.jar:/kafka/bin/../libs/rocksdbjni-4.9.0.jar:/kafka/bin/../libs/scala-library-2.11.8.jar:/kafka/bin/../libs/scala-parser-combinators_2.11-1.0.4.jar:/kafka/bin/../libs/slf4j-api-1.7.21.jar:/kafka/bin/../libs/slf4j-log4j12-1.7.21.jar:/kafka/bin/../libs/snappy-java-1.1.2.6.jar:/kafka/bin/../libs/validation-api-1.1.0.Final.jar:/kafka/bin/../libs/zkclient-0.9.jar:/kafka/bin/../libs/zookeeper-3.4.8.jar > kafka.Kafka /kafka/config/server.properties > This is running inside a docker container. > Reporter: Xavier Lange > > My cluster is unable to communicate to one of my brokers (Broker 1 in this > case) and is spinning on logs: > {code} > [2016-11-29 19:05:08,659] WARN [ReplicaFetcherThread-0-1], Error in fetch > kafka.server.ReplicaFetcherThread$FetchRequest@27aeb5f4 > (kafka.server.ReplicaFetcherThread) > java.io.IOException: Connection to 10.60.68.122:9092 (id: 1 rack: null) failed > at > kafka.utils.NetworkClientBlockingOps$.awaitReady$1(NetworkClientBlockingOps.scala:83) > at > kafka.utils.NetworkClientBlockingOps$.blockingReady$extension(NetworkClientBlockingOps.scala:93) > at > kafka.server.ReplicaFetcherThread.sendRequest(ReplicaFetcherThread.scala:248) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:238) > at > kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:42) > at > kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:118) > at > kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:103) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63) > {code} > and on Broker 1 I tried issuing a shutdown and I get this: > {code} > [2016-11-29 18:33:16,152] INFO [Kafka Server 1], shutting down > (kafka.server.KafkaServer) > [2016-11-29 18:33:16,163] INFO [Kafka Server 1], Starting controlled shutdown > (kafka.server.KafkaServer) > [2016-11-29 18:33:38,347] INFO [Kafka Server 1], Remaining partitions to > move: LONG_LIST_OF_TOPICS > [2016-11-29 18:33:38,350] INFO [Kafka Server 1], Error code from controller: > 0 (kafka.server.KafkaServer) > [2016-11-29 18:33:43,356] WARN [Kafka Server 1], Retrying controlled shutdown > after the previous attempt failed... (kafka.server.KafkaServer) > [2016-11-29 18:34:04,053] INFO [Kafka Server 1], Remaining partitions to > move: SAME_LONG_LIST_OF_TOPICS_AGAIN > [2016-11-29 18:34:04,053] INFO [Kafka Server 1], Error code from controller: > 0 (kafka.server.KafkaServer) > [2016-11-29 18:34:09,054] WARN [Kafka Server 1], Retrying controlled shutdown > after the previous attempt failed... (kafka.server.KafkaServer) > [2016-11-29 18:34:32,577] INFO [Kafka Server 1], Remaining partitions to > move: SAM_LONG_LIST_OF_TOPICS_AGAIN_AGAIN > [2016-11-29 18:34:32,578] INFO [Kafka Server 1], Error code from controller: > 0 (kafka.server.KafkaServer) > [2016-11-29 18:34:37,579] WARN [Kafka Server 1], Retrying controlled shutdown > after the previous attempt failed... (kafka.server.KafkaServer) > [2016-11-29 18:34:37,586] WARN [Kafka Server 1], Proceeding to do an unclean > shutdown as all the controlled shutdown attempts failed > (kafka.server.KafkaServer) > [2016-11-29 18:34:37,612] INFO [Socket Server on Broker 1], Shutting down > (kafka.network.SocketServer) > [2016-11-29 18:42:36,940] INFO Rolled new log segment for > '__consumer_offsets-30' in 6 ms. (kafka.log.Log) > [2016-11-29 18:43:52,440] INFO Deleting segment 71712593 from log > __consumer_offsets-30. (kafka.log.Log) > [2016-11-29 18:43:52,440] INFO Deleting segment 0 from log > __consumer_offsets-30. (kafka.log.Log) > [2016-11-29 18:43:52,492] INFO Deleting index > /data/__consumer_offsets-30/00000000000071712593.index.deleted > (kafka.log.OffsetIndex) > [2016-11-29 18:43:52,532] INFO Deleting index > /data/__consumer_offsets-30/00000000000000000000.index.deleted > (kafka.log.OffsetIndex) > [2016-11-29 18:43:52,532] INFO Deleting index > /data/__consumer_offsets-30/00000000000000000000.timeindex.deleted > (kafka.log.TimeIndex) > [2016-11-29 18:43:52,549] INFO Deleting index > /data/__consumer_offsets-30/00000000000071712593.timeindex.deleted > (kafka.log.TimeIndex) > [2016-11-29 18:43:53,370] INFO Deleting segment 72483593 from log > __consumer_offsets-30. (kafka.log.Log) > [2016-11-29 18:43:53,478] INFO Deleting index > /data/__consumer_offsets-30/00000000000072483593.index.deleted > (kafka.log.OffsetIndex) > [2016-11-29 18:43:53,479] INFO Deleting index > /data/__consumer_offsets-30/00000000000072483593.timeindex.deleted > (kafka.log.TimeIndex) > {code} > so it says it's doing an unclean shutdown but then it refuses to stop the > process. now I have this sort of zombie process and the other brokers are > spinning even faster on trying to connect to it. > What other logs can I provide to help debug this broker's failure? -- This message was sent by Atlassian JIRA (v6.3.4#6332)