[ https://issues.apache.org/jira/browse/KAFKA-4277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950689#comment-15950689 ]
Johan Ström edited comment on KAFKA-4277 at 3/31/17 11:24 AM: -------------------------------------------------------------- Hi, we just had the same problem. Our setup: 4x kafka_2.11-0.10.1.1 3x zookeeper-3.4.8 For now, only 1 topic + consumer_offsets, pretty low traffic and fortunately no critical data. For some external reason (not yet known), the three zookeepers lost connection with each other, and the brokers all lost connection to Zookeeper. This happened twice. All nodes reported they where in state Running via kafka.metrics.reporters (kafka.metrics.KafkaGraphiteMetricsReporter in our case). Some did however report underreplicated partitions, that was how we noticed it in the first place (unfortunately too late by watching metrics, no active monitoring of the actual numbers). TL;DR for one of the brokers: at 23:59:04,000 zookeper leader decides the session is expired, and at 23:59:07,850 kafka decides the session is expired. At 23:59:10,709 the broker has opened a new session, but fails to create it's own node since it already exists. The two Zk followers logged Read timeout at exactly the same time: zk1: 2017-03-28 23:05:55,923 zk2: 2017-03-28 23:05:55,923 zk3: time unknown, log was rolled over Within a second they are back online, now with 2 as leader. Almost an hour later, a new drop: {noformat} 2017-03-28 23:59:08,901 [myid:1] - WARN [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] - Exception when following the leader java.net.SocketTimeoutException: Read timed out {noformat} zk2 (now leader) reported a similar: {noformat} 2017-03-28 23:59:08,901 [myid:2] - ERROR [LearnerHandler-/10.88.176.158:46722:LearnerHandler@631] - Unexpected exception causing shutdown while sock still open java.net.SocketTimeoutException: Read timed out {noformat} Unfortunately no log from zk3 due to rollover. After this, zk3 took over as leader. Let's first look at where zk2 took over as leader. Immediately after, it started logging these: {noformat} 2017-03-28 23:05:56,671 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@651] - Got user-level KeeperException when processing sessionid:0x15b16bd4cd00000 type:create cxid:0x3 zxid:0x900000002 txntype:-1 reqpath:n/a Error Path:/kafka/brokers Error:KeeperErrorCode = NodeExists for /kafka/brokers 2017-03-28 23:05:56,671 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@651] - Got user-level KeeperException when processing sessionid:0x15b16bd4cd00000 type:create cxid:0x4 zxid:0x900000003 txntype:-1 reqpath:n/a Error Path:/kafka/brokers/ids Error:KeeperErrorCode = NodeExists for /kafka/brokers/ids 2017-03-28 23:05:56,672 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@651] - Got user-level KeeperException when processing sessionid:0x15b16bd4cd00000 type:create cxid:0x5 zxid:0x900000004 txntype:-1 reqpath:n/a Error Path:/kafka/brokers/ids/2 Error:KeeperErrorCode = NodeExists for /kafka/brokers/ids/2 {noformat} Unfortunately, this never recovers and /brokers/ids/ are eventually empty, since Kafka fails to re-register. In addition to the above, it also starts spamming these: {noformat} 2017-03-28 23:06:19,611 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@651] - Got user-level KeeperException when processing sessionid:0x15b16bd4cd00001 type:setData cxid:0xb zxid:0x900000077 txntype:-1 re qpath:n/a Error Path:/kafka/brokers/topics/iptv-engagement-event/partitions/3/state Error:KeeperErrorCode = BadVersion for /kafka/brokers/topics/iptv-engagement-event/partitions/3/state {noformat} These will continue for all partitions (or at least alot of them) on the leader (first 2 then 3) until we restart the brokers. --- When it comes to the brokers. broker 1, at first error: {noformat} [2017-03-28 23:05:55,560] WARN Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:55,560] INFO Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:55,661] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:05:55,711] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:56,711] INFO Socket connection established to zookeeper02/10.88.176.168:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:56,714] INFO Session establishment complete on server zookeeper02/10.88.176.168:2181, sessionid = 0x359b1ad98b00004, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:56,714] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) {noformat} After this, it dropped some replication fetchers but no more errors. Until the next outage: Almost one hour later, a new timeout (still broker 1): {noformat} [2017-03-28 23:59:01,596] WARN Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:01,597] INFO Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:01,697] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:01,736] INFO Opening socket connection to server zookeeper01/10.88.176.167:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:03,739] WARN Client session timed out, have not heard from server in 2042ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:03,739] INFO Client session timed out, have not heard from server in 2042ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:04,375] INFO Opening socket connection to server zookeeper03/10.88.176.158:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:06,376] WARN Client session timed out, have not heard from server in 2537ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:06,376] INFO Client session timed out, have not heard from server in 2537ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,849] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,849] INFO Socket connection established to zookeeper02/10.88.176.168:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,850] WARN Unable to reconnect to ZooKeeper service, session 0x359b1ad98b00004 has expired (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,850] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:07,850] INFO Initiating client connection, connectString=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181/kafka sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@4fb61f4a (org.apache.zookeeper.ZooKeeper) [2017-03-28 23:59:07,850] INFO Unable to reconnect to ZooKeeper service, session 0x359b1ad98b00004 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO EventThread shut down for session: 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO Socket connection established to zookeeper02/10.88.176.168:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:09,854] WARN Client session timed out, have not heard from server in 2002ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:09,854] INFO Client session timed out, have not heard from server in 2002ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:09,955] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:10,699] INFO Opening socket connection to server zookeeper01/10.88.176.167:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:10,700] INFO Socket connection established to zookeeper01/10.88.176.167:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:10,702] INFO Session establishment complete on server zookeeper01/10.88.176.167:2181, sessionid = 0x15b16ee049c0000, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:10,702] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:10,705] INFO re-registering broker info in ZK for broker 1 (kafka.server.KafkaHealthcheck$SessionExpireListener) [2017-03-28 23:59:10,706] INFO Creating /brokers/ids/1 (is it secure? false) (kafka.utils.ZKCheckedEphemeral) [2017-03-28 23:59:10,709] INFO Result of znode creation is: NODEEXISTS (kafka.utils.ZKCheckedEphemeral) [2017-03-28 23:59:10,709] ERROR Error handling event ZkEvent[New session event sent to kafka.server.KafkaHealthcheck$SessionExpireListener@70bba1f7] (org.I0Itec.zkclient.ZkEventThread) java.lang.RuntimeException: A broker is already registered on the path /brokers/ids/1. This probably indicates that you either have configured a brokerid that is already in use, or else you have shutdown this broker and restarted it faster than the zookeeper timeout so it appears to be re-registering. at kafka.utils.ZkUtils.registerBrokerInZk(ZkUtils.scala:393) at kafka.utils.ZkUtils.registerBrokerInZk(ZkUtils.scala:379) at kafka.server.KafkaHealthcheck.register(KafkaHealthcheck.scala:70) at kafka.server.KafkaHealthcheck$SessionExpireListener.handleNewSession(KafkaHealthcheck.scala:104) at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:735) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) {noformat} Comparing to the other brokers, broker 2 and 3 failed its /brokers/ids/N registration at 23:05:55, and broker 4 failed at 23:59:11 same as broker 1. Let's take a look at broker1's (now expired) ZK session with ID 0x359b1ad98b00004: {noformat} [2017-03-28 23:05:56,714] INFO Session establishment complete on server zookeeper02/10.88.176.168:2181, sessionid = 0x359b1ad98b00004, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) ... [2017-03-28 23:59:01,596] WARN Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:01,597] INFO Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:01,697] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:01,736] INFO Opening socket connection to server zookeeper01/10.88.176.167:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:03,739] WARN Client session timed out, have not heard from server in 2042ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:03,739] INFO Client session timed out, have not heard from server in 2042ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:04,375] INFO Opening socket connection to server zookeeper03/10.88.176.158:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:06,376] WARN Client session timed out, have not heard from server in 2537ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:06,376] INFO Client session timed out, have not heard from server in 2537ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,849] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,849] INFO Socket connection established to zookeeper02/10.88.176.168:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,850] WARN Unable to reconnect to ZooKeeper service, session 0x359b1ad98b00004 has expired (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,850] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:07,850] INFO Initiating client connection, connectString=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181/kafka sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@4fb61f4a (org.apache.zookeeper.ZooKeeper) [2017-03-28 23:59:07,850] INFO Unable to reconnect to ZooKeeper service, session 0x359b1ad98b00004 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO EventThread shut down for session: 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) {noformat} The same session, on the zookeeper side: {noformat} 2017-03-28 23:05:56,714 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to renew session 0x359b1ad98b00004 at /10.88.176.164:60633 2017-03-28 23:05:56,714 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@645] - Established session 0x359b1ad98b00004 with negotiated timeout 6000 for client /10.88.176.164:60633 ... 2017-03-28 23:59:04,000 [myid:2] - INFO [SessionTracker:ZooKeeperServer@355] - Expiring session 0x359b1ad98b00004, timeout of 6000ms exceeded 2017-03-28 23:59:04,001 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@489] - Processed session termination for sessionid: 0x359b1ad98b00004 2017-03-28 23:59:05,867 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x359b1ad98b00004, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203) at java.lang.Thread.run(Thread.java:745) 2017-03-28 23:59:05,867 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /10.88.176.164:60633 which had sessionid 0x359b1ad98b00004 ... 2017-03-28 23:59:07,850 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to renew session 0x359b1ad98b00004 at /10.88.176.164:32922 2017-03-28 23:59:07,850 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@638] - Invalid session 0x359b1ad98b00004 for client /10.88.176.164:32922, probably expired 2017-03-28 23:59:07,850 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /10.88.176.164:32922 which had sessionid 0x359b1ad98b00004 {noformat} So, basically, the brokers are never able to re-register as described by others in this ticket. broker 1 doesn't log any more errors at all, and neither does broker 4. Weird though, since the zookeeper session has expired (according to Zk logs) several seconds before the broker tries to recreate it. Unfortunately the ZK cluster switchet to node 3 for leader just before the node creation was attempted, so I have no logs for that end :/ Additionaly, on broker 2 and broker 3, the following is spammed for all (most) partitions until they are restarted: {noformat} [2017-03-28 23:05:56,674] ERROR Error handling event ZkEvent[New session event sent to kafka.server.KafkaHealthcheck$SessionExpireListener@3767aad0] (org.I0Itec.zkclient.ZkEventThread) java.lang.RuntimeException: A broker is already registered on the path /brokers/ids/2. This probably indicates that you either have configured a brokerid that is already in use, or else you have shutdown this broker and restarted it faster than the zookeeper timeout so it appears to be re-registering. [2017-03-28 23:06:20,454] INFO Partition [__consumer_offsets,37] on broker 2: Shrinking ISR for partition [__consumer_offsets,37] from 2,1,3 to 2,3 (kafka.cluster.Partition) [2017-03-28 23:06:20,466] INFO Partition [__consumer_offsets,37] on broker 2: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) ..... and one it goes forever {noformat} Hopefully this can help pinpoint the cause.. and get a bit closer to a fix.. Let me know if more logs are useful, I'll keep them (except for zookeeper 3 which was overwritten before this was noticed). was (Author: jstrom): Hi, we just had the same problem. Our setup: 4x kafka_2.11-0.10.1.1 3x zookeeper-3.4.8 For now, only 1 topic + consumer_offsets, pretty low traffic and fortunately no critical data. For some external reason (not yet known), the three zookeepers lost connection with each other, and the brokers all lost connection to Zookeeper. This happened twice. All nodes reported they where in state Running via kafka.metrics.reporters (kafka.metrics.KafkaGraphiteMetricsReporter in our case). Some did however report underreplicated partitions, that was how we noticed it in the first place (unfortunately too late by watching metrics, no active monitoring of the actual numbers). TL;DR for one of the brokers: at 23:59:04,000 zookeper leader decides the session is expired, and at 23:59:07,850 kafka decides the session is expired. At 23:59:10,709 the broker has opened a new session, but fails to create it's own node since it already exists. The two Zk followers logged Read timeout at exactly the same time: zk1: 2017-03-28 23:05:55,923 zk2: 2017-03-28 23:05:55,923 zk3: time unknown, log was rolled over Within a second they are back online, now with 2 as leader. Almost an hour later, a new drop: {noformat} 2017-03-28 23:59:08,901 [myid:1] - WARN [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@89] - Exception when following the leader java.net.SocketTimeoutException: Read timed out {noformat} zk2 (now leader) reported a similar: {noformat} 2017-03-28 23:59:08,901 [myid:2] - ERROR [LearnerHandler-/10.88.176.158:46722:LearnerHandler@631] - Unexpected exception causing shutdown while sock still open java.net.SocketTimeoutException: Read timed out {noformat} Unfortunately no log from zk3 due to rollover. After this, zk3 took over as leader. Let's first look at where zk2 took over as leader. Immediately after, it started logging these: {noformat} 2017-03-28 23:05:56,671 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@651] - Got user-level KeeperException when processing sessionid:0x15b16bd4cd00000 type:create cxid:0x3 zxid:0x900000002 txntype:-1 reqpath:n/a Error Path:/kafka/brokers Error:KeeperErrorCode = NodeExists for /kafka/brokers 2017-03-28 23:05:56,671 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@651] - Got user-level KeeperException when processing sessionid:0x15b16bd4cd00000 type:create cxid:0x4 zxid:0x900000003 txntype:-1 reqpath:n/a Error Path:/kafka/brokers/ids Error:KeeperErrorCode = NodeExists for /kafka/brokers/ids 2017-03-28 23:05:56,672 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@651] - Got user-level KeeperException when processing sessionid:0x15b16bd4cd00000 type:create cxid:0x5 zxid:0x900000004 txntype:-1 reqpath:n/a Error Path:/kafka/brokers/ids/2 Error:KeeperErrorCode = NodeExists for /kafka/brokers/ids/2 {noformat} Unfortunately, this never recovers and /brokers/ids/ are eventually empty, since Kafka fails to re-register. In addition to the above, it also starts spamming these: {noformat} 2017-03-28 23:06:19,611 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@651] - Got user-level KeeperException when processing sessionid:0x15b16bd4cd00001 type:setData cxid:0xb zxid:0x900000077 txntype:-1 re qpath:n/a Error Path:/kafka/brokers/topics/iptv-engagement-event/partitions/3/state Error:KeeperErrorCode = BadVersion for /kafka/brokers/topics/iptv-engagement-event/partitions/3/state {noformat} These will continue for all partitions (or at least alot of them) on the leader (first 2 then 3) until we restart the brokers. --- When it comes to the brokers. broker 1, at first error: {noformat} [2017-03-28 23:05:55,560] WARN Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:55,560] INFO Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:55,661] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:05:55,711] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:56,711] INFO Socket connection established to zookeeper02/10.88.176.168:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:56,714] INFO Session establishment complete on server zookeeper02/10.88.176.168:2181, sessionid = 0x359b1ad98b00004, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:05:56,714] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) {noformat} After this, it dropped some replication fetchers but no more errors. Until the next outage: Almost one hour later, a new timeout (still broker 1): {noformat} [2017-03-28 23:59:01,596] WARN Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:01,597] INFO Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:01,697] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:01,736] INFO Opening socket connection to server zookeeper01/10.88.176.167:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:03,739] WARN Client session timed out, have not heard from server in 2042ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:03,739] INFO Client session timed out, have not heard from server in 2042ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:04,375] INFO Opening socket connection to server zookeeper03/10.88.176.158:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:06,376] WARN Client session timed out, have not heard from server in 2537ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:06,376] INFO Client session timed out, have not heard from server in 2537ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,849] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,849] INFO Socket connection established to zookeeper02/10.88.176.168:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,850] WARN Unable to reconnect to ZooKeeper service, session 0x359b1ad98b00004 has expired (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,850] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:07,850] INFO Initiating client connection, connectString=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181/kafka sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@4fb61f4a (org.apache.zookeeper.ZooKeeper) [2017-03-28 23:59:07,850] INFO Unable to reconnect to ZooKeeper service, session 0x359b1ad98b00004 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO EventThread shut down for session: 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO Socket connection established to zookeeper02/10.88.176.168:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:09,854] WARN Client session timed out, have not heard from server in 2002ms for sessionid 0x0 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:09,854] INFO Client session timed out, have not heard from server in 2002ms for sessionid 0x0, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:09,955] INFO Waiting for keeper state SyncConnected (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:10,699] INFO Opening socket connection to server zookeeper01/10.88.176.167:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:10,700] INFO Socket connection established to zookeeper01/10.88.176.167:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:10,702] INFO Session establishment complete on server zookeeper01/10.88.176.167:2181, sessionid = 0x15b16ee049c0000, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:10,702] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:10,705] INFO re-registering broker info in ZK for broker 1 (kafka.server.KafkaHealthcheck$SessionExpireListener) [2017-03-28 23:59:10,706] INFO Creating /brokers/ids/1 (is it secure? false) (kafka.utils.ZKCheckedEphemeral) [2017-03-28 23:59:10,709] INFO Result of znode creation is: NODEEXISTS (kafka.utils.ZKCheckedEphemeral) [2017-03-28 23:59:10,709] ERROR Error handling event ZkEvent[New session event sent to kafka.server.KafkaHealthcheck$SessionExpireListener@70bba1f7] (org.I0Itec.zkclient.ZkEventThread) java.lang.RuntimeException: A broker is already registered on the path /brokers/ids/1. This probably indicates that you either have configured a brokerid that is already in use, or else you have shutdown this broker and restarted it faster than the zookeeper timeout so it appears to be re-registering. at kafka.utils.ZkUtils.registerBrokerInZk(ZkUtils.scala:393) at kafka.utils.ZkUtils.registerBrokerInZk(ZkUtils.scala:379) at kafka.server.KafkaHealthcheck.register(KafkaHealthcheck.scala:70) at kafka.server.KafkaHealthcheck$SessionExpireListener.handleNewSession(KafkaHealthcheck.scala:104) at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:735) at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) {noformat} Comparing to the other brokers, broker 2 and 3 failed its /brokers/ids/N registration at 23:05:55, and broker 4 failed at 23:59:11 same as broker 1. Let's take a look at broker1's (now expired) ZK session with ID 0x359b1ad98b00004: {noformat} [2017-03-28 23:05:56,714] INFO Session establishment complete on server zookeeper02/10.88.176.168:2181, sessionid = 0x359b1ad98b00004, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) ... [2017-03-28 23:59:01,596] WARN Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:01,597] INFO Client session timed out, have not heard from server in 4002ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:01,697] INFO zookeeper state changed (Disconnected) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:01,736] INFO Opening socket connection to server zookeeper01/10.88.176.167:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:03,739] WARN Client session timed out, have not heard from server in 2042ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:03,739] INFO Client session timed out, have not heard from server in 2042ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:04,375] INFO Opening socket connection to server zookeeper03/10.88.176.158:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:06,376] WARN Client session timed out, have not heard from server in 2537ms for sessionid 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:06,376] INFO Client session timed out, have not heard from server in 2537ms for sessionid 0x359b1ad98b00004, closing socket connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,849] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,849] INFO Socket connection established to zookeeper02/10.88.176.168:2181, initiating session (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,850] WARN Unable to reconnect to ZooKeeper service, session 0x359b1ad98b00004 has expired (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,850] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient) [2017-03-28 23:59:07,850] INFO Initiating client connection, connectString=zookeeper01:2181,zookeeper02:2181,zookeeper03:2181/kafka sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@4fb61f4a (org.apache.zookeeper.ZooKeeper) [2017-03-28 23:59:07,850] INFO Unable to reconnect to ZooKeeper service, session 0x359b1ad98b00004 has expired, closing socket connection (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO Opening socket connection to server zookeeper02/10.88.176.168:2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn) [2017-03-28 23:59:07,852] INFO EventThread shut down for session: 0x359b1ad98b00004 (org.apache.zookeeper.ClientCnxn) {noformat} The same session, on the zookeeper side: {noformat} 2017-03-28 23:05:56,714 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to renew session 0x359b1ad98b00004 at /10.88.176.164:60633 2017-03-28 23:05:56,714 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@645] - Established session 0x359b1ad98b00004 with negotiated timeout 6000 for client /10.88.176.164:60633 ... 2017-03-28 23:59:04,000 [myid:2] - INFO [SessionTracker:ZooKeeperServer@355] - Expiring session 0x359b1ad98b00004, timeout of 6000ms exceeded 2017-03-28 23:59:04,001 [myid:2] - INFO [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@489] - Processed session termination for sessionid: 0x359b1ad98b00004 2017-03-28 23:59:05,867 [myid:2] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception EndOfStreamException: Unable to read additional data from client sessionid 0x359b1ad98b00004, likely client has closed socket at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:230) at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203) at java.lang.Thread.run(Thread.java:745) 2017-03-28 23:59:05,867 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /10.88.176.164:60633 which had sessionid 0x359b1ad98b00004 ... 2017-03-28 23:59:07,850 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@893] - Client attempting to renew session 0x359b1ad98b00004 at /10.88.176.164:32922 2017-03-28 23:59:07,850 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@638] - Invalid session 0x359b1ad98b00004 for client /10.88.176.164:32922, probably expired 2017-03-28 23:59:07,850 [myid:2] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1008] - Closed socket connection for client /10.88.176.164:32922 which had sessionid 0x359b1ad98b00004 {noformat} So, basically, the brokers are never able to re-register as described by others in this ticket. broker 1 doesn't log any more errors at all, and neither does broker 4. Weird though, since the zookeeper session has expired (according to Zk logs) several seconds before the broker tries to recreate it. Additionaly, on broker 2 and broker 3, the following is spammed for all (most) partitions until they are restarted: {noformat} [2017-03-28 23:05:56,674] ERROR Error handling event ZkEvent[New session event sent to kafka.server.KafkaHealthcheck$SessionExpireListener@3767aad0] (org.I0Itec.zkclient.ZkEventThread) java.lang.RuntimeException: A broker is already registered on the path /brokers/ids/2. This probably indicates that you either have configured a brokerid that is already in use, or else you have shutdown this broker and restarted it faster than the zookeeper timeout so it appears to be re-registering. [2017-03-28 23:06:20,454] INFO Partition [__consumer_offsets,37] on broker 2: Shrinking ISR for partition [__consumer_offsets,37] from 2,1,3 to 2,3 (kafka.cluster.Partition) [2017-03-28 23:06:20,466] INFO Partition [__consumer_offsets,37] on broker 2: Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition) ..... and one it goes forever {noformat} Hopefully this can help pinpoint the cause.. and get a bit closer to a fix.. Let me know if more logs are useful, I'll keep them (except for zookeeper 3 which was overwritten before this was noticed). > creating ephemeral node already exist > ------------------------------------- > > Key: KAFKA-4277 > URL: https://issues.apache.org/jira/browse/KAFKA-4277 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.0.0 > Reporter: Feixiang Yan > > I use zookeeper 3.4.6. > Zookeeper session time out, zkClient try reconnect failed. Then re-establish > the session and re-registering broker info in ZK, throws NODEEXISTS Exception. > I think it is because the ephemeral node which created by old session has > not removed. > I read the > [ZkUtils.scala|https://github.com/apache/kafka/blob/0.8.1/core/src/main/scala/kafka/utils/ZkUtils.scala] > of 0.8.1, createEphemeralPathExpectConflictHandleZKBug try create node in a > while loop until create success. This can solve the issue. But in > [ZkUtils.scala|https://github.com/apache/kafka/blob/0.10.0.1/core/src/main/scala/kafka/utils/ZkUtils.scala] > 0.10.1 the function removed. > {noformat} > [2016-10-07 19:00:32,562] INFO Socket connection established to > 10.191.155.238/10.191.155.238:21819, initiating session > (org.apache.zookeeper.ClientCnxn) > [2016-10-07 19:00:32,563] INFO zookeeper state changed (Expired) > (org.I0Itec.zkclient.ZkClient) > [2016-10-07 19:00:32,564] INFO Unable to reconnect to ZooKeeper service, > session 0x1576b11f9b201bd has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2016-10-07 19:00:32,564] INFO Initiating client connection, > connectString=10.191.155.237:21819,10.191.155.238:21819,10.191.155.239:21819/cluster2 > sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@ae71be2 > (org.apache.zookeeper.ZooKeeper) > [2016-10-07 19:00:32,566] INFO Opening socket connection to server > 10.191.155.237/10.191.155.237:21819. Will not attempt to authenticate using > SASL (unknown error) (org.apache.zookeeper.ClientCnxn) > [2016-10-07 19:00:32,566] INFO Socket connection established to > 10.191.155.237/10.191.155.237:21819, initiating session > (org.apache.zookeeper.ClientCnxn) > [2016-10-07 19:00:32,566] INFO EventThread shut down > (org.apache.zookeeper.ClientCnxn) > [2016-10-07 19:00:32,567] INFO Session establishment complete on server > 10.191.155.237/10.191.155.237:21819, sessionid = 0x1579ecd39c20006, > negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn) > [2016-10-07 19:00:32,567] INFO zookeeper state changed (SyncConnected) > (org.I0Itec.zkclient.ZkClient) > [2016-10-07 19:00:32,608] INFO re-registering broker info in ZK for broker 3 > (kafka.server.KafkaHealthcheck$SessionExpireListener) > [2016-10-07 19:00:32,610] INFO Creating /brokers/ids/3 (is it secure? false) > (kafka.utils.ZKCheckedEphemeral) > [2016-10-07 19:00:32,611] INFO Result of znode creation is: NODEEXISTS > (kafka.utils.ZKCheckedEphemeral) > [2016-10-07 19:00:32,614] ERROR Error handling event ZkEvent[New session > event sent to kafka.server.KafkaHealthcheck$SessionExpireListener@324f1bc] > (org.I0Itec.zkclient.ZkEventThread) > java.lang.RuntimeException: A broker is already registered on the path > /brokers/ids/3. This probably indicates that you either have configured a > brokerid that is already in use, or else you have shutdown this broker and > restarted it faster than the zookeeper timeout so it appears to be > re-registering. > at kafka.utils.ZkUtils.registerBrokerInZk(ZkUtils.scala:305) > at kafka.utils.ZkUtils.registerBrokerInZk(ZkUtils.scala:291) > at kafka.server.KafkaHealthcheck.register(KafkaHealthcheck.scala:70) > at > kafka.server.KafkaHealthcheck$SessionExpireListener.handleNewSession(KafkaHealthcheck.scala:104) > at org.I0Itec.zkclient.ZkClient$6.run(ZkClient.java:735) > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)