Hello All, After performing an upgrade of our Kafka 0.8.0 to Kafka 0.8.1, we are receiving a failure in the preferred replica election process. I am not sure if this is a known issue or not. This is a two node Kafka cluster in our QA environment (replica 2) with a total of 2100+ partitions over 20 topics.
Upon the first run, we get the following output: Performing Rebalancing Process... 2014-04-14 17:04:22 DEBUG zkclient.ZkConnection - Creating new ZookKeeper instance to connect to tvip-m1-mw-zookeeper.dish.com:2181. 2014-04-14 17:04:22 INFO zkclient.ZkEventThread - Starting ZkClient event thread. 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:host.name=tm1mwdpl05 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:java.version=1.7.0_51 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:java.vendor=Oracle Corporation 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:java.home=/apps/des/jdk1.7.0_51/jre 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:java.class.path=:/apps/des/kafka/bin/../core/build/dependant-libs-2.8.0/*.jar:/apps/des/kafka/bin/../perf/build/libs//kafka-perf_2.8.0*.jar:/apps/des/kafka/bin/../clients/build/libs//kafka-clients*.jar:/apps/des/kafka/bin/../examples/build/libs//kafka-examples*.jar:/apps/des/kafka/bin/../contrib/hadoop-consumer/build/libs//kafka-hadoop-consumer*.jar:/apps/des/kafka/bin/../contrib/hadoop-producer/build/libs//kafka-hadoop-producer*.jar:/apps/des/kafka/bin/../libs/jopt-simple-3.2.jar:/apps/des/kafka/bin/../libs/kafka_2.10-0.8.1.jar:/apps/des/kafka/bin/../libs/log4j-1.2.15.jar:/apps/des/kafka/bin/../libs/metrics-annotation-2.2.0.jar:/apps/des/kafka/bin/../libs/metrics-core-2.2.0.jar:/apps/des/kafka/bin/../libs/scala-library-2.10.1.jar:/apps/des/kafka/bin/../libs/slf4j-api-1.7.2.jar:/apps/des/kafka/bin/../libs/snappy-java-1.0.5.jar:/apps/des/kafka/bin/../libs/zkclient-0.3.jar:/apps/des/kafka/bin/../libs/zookeeper-3.3.4.jar:/apps/des/kafka/bin/../core/build/libs/kafka_2.8.0*.jar 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:java.compiler=<NA> 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:os.name=Linux 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:os.arch=amd64 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:os.version=2.6.32-220.el6.x86_64 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:user.name=desadmin 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:user.home=/apps/des 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Client environment:user.dir=/apps/des/bin 2014-04-14 17:04:22 INFO zookeeper.ZooKeeper - Initiating client connection, connectString=tvip-m1-mw-zookeeper.dish.com:2181 sessionTimeout=30000 watcher=org.I0Itec.zkclient.ZkClient@77a63a20 2014-04-14 17:04:22 DEBUG zookeeper.ClientCnxn - zookeeper.disableAutoWatchReset is false 2014-04-14 17:04:22 DEBUG zkclient.ZkClient - Awaiting connection to Zookeeper server 2014-04-14 17:04:22 DEBUG zkclient.ZkClient - Waiting for keeper state SyncConnected 2014-04-14 17:04:22 INFO zookeeper.ClientCnxn - Opening socket connection to server tvip-m1-mw-zookeeper.dish.com/10.18.60.25:2181 2014-04-14 17:04:22 INFO zookeeper.ClientCnxn - Socket connection established to tvip-m1-mw-zookeeper.dish.com/10.18.60.25:2181, initiating session 2014-04-14 17:04:22 DEBUG zookeeper.ClientCnxn - Session establishment request sent on tvip-m1-mw-zookeeper.dish.com/10.18.60.25:2181 2014-04-14 17:04:22 INFO zookeeper.ClientCnxn - Session establishment complete on server tvip-m1-mw-zookeeper.dish.com/10.18.60.25:2181, sessionid = 0x2453d08efcac215, negotiated timeout = 30000 2014-04-14 17:04:22 DEBUG zkclient.ZkClient - Received event: WatchedEvent state:SyncConnected type:None path:null 2014-04-14 17:04:22 INFO zkclient.ZkClient - zookeeper state changed (SyncConnected) 2014-04-14 17:04:22 DEBUG zkclient.ZkClient - Leaving process event 2014-04-14 17:04:22 DEBUG zkclient.ZkClient - State is SyncConnected 2014-04-14 17:04:22 DEBUG zookeeper.ClientCnxn - Reading reply sessionid:0x2453d08efcac215, packet:: clientPath:null serverPath:null finished:false header:: 1,8 replyHeader:: 1,73101951651,0 request:: '/brokers/topics,F response:: v{'bill-generated,'bob_test_topic,'rain-burn-in,'account-adjustment-issued,'account-access,'customer-cues,'submit-agreement,'logtest_v3,'account-created,'rain-load-test,'outbound-call-attempted,'account-info-updated,'rain-enhanced-consumer,'threshold-exceeded,'account-activated,'rain-enhanced,'outbound-communications} 2014-04-14 17:04:22 DEBUG zookeeper.ClientCnxn - Reading reply sessionid:0x2453d08efcac215, packet:: clientPath:null serverPath:null finished:false header:: 2,8 replyHeader:: 2,73101951651,0 request:: '/brokers/topics/bill-generated/partitions,F response:: v{'35,'36,'33,'34,'39,'37,'38,'43,'42,'41,'40,'22,'23,'24,'25,'26,'27,'28,'29,'3,'2,'1,'0,'30,'7,'6,'32,'5,'4,'31,'9,'8,'19,'17,'18,'15,'16,'13,'14,'11,'12,'21,'20,'109,'108,'107,'106,'105,'104,'103,'99,'102,'101,'100,'98,'97,'96,'95,'94,'93,'92,'91,'90,'10,'88,'89,'116,'79,'117,'114,'78,'77,'115,'112,'113,'110,'111,'118,'119,'82,'83,'80,'81,'86,'87,'84,'85,'67,'125,'126,'66,'69,'127,'68,'128,'121,'122,'123,'124,'129,'70,'71,'120,'72,'73,'74,'75,'76,'59,'58,'57,'56,'55,'64,'65,'62,'63,'60,'61,'130,'49,'48,'45,'44,'47,'46,'51,'52,'53,'54,'50} 2014-04-14 17:04:22 DEBUG zookeeper.ClientCnxn - Reading reply sessionid:0x2453d08efcac215, packet:: clientPath:null serverPath:null finished:false header:: 3,8 replyHeader:: 3,73101951651,0 request:: '/brokers/topics/bob_test_topic/partitions,F response:: v{'3,'2,'1,'0,'6,'5,'4} 2014-04-14 17:05:11 DEBUG zookeeper.ClientCnxn - Reading reply sessionid:0x2453d08efcac215, packet:: clientPath:null serverPath:null finished:false header:: 2122,1 replyHeader:: 2122,73101958104,0 request:: '/admin/preferred_replica_election,#7b2276657273696f6e *big list of hex values* 6974696f6e223a35317d2c7b22746f706963223a2262696c6c2d67656e657261 746564222c22706172746974696f6e223a34377d2c7b22746f706963223a226c6f67746573745f7633222c22706172746974696f6e223a377d2c7b22746f706963223a227261696e2d656e68616e6 365642d636f6e73756d6572222c22706172746974696f6e223a357d5d7d,v{s{31,s{'world,'anyone}}},0 response:: '/admin/preferred_replica_election 2014-04-14 17:05:11 INFO admin.PreferredReplicaLeaderElectionCommand$ - Created preferred replica election path with {"version":1,"partitions":[{"topic":"acc ount-info-updated","partition":36},{"topic":"outbound-communications","partition":39},{"topic":"account-access","partition":107},{"topic":"account-created"," partition":8},{"topic":"rain-burn-in","partition":46} *huge list of topics and partitions* [submit-agreement,31]) 2014-04-14 17:05:11 DEBUG zkclient.ZkClient - Closing ZkClient... 2014-04-14 17:05:11 INFO zkclient.ZkEventThread - Terminate ZkClient event thread. 2014-04-14 17:05:11 DEBUG zkclient.ZkConnection - Closing ZooKeeper connected to tvip-m1-mw-zookeeper.dish.com:2181 2014-04-14 17:05:11 DEBUG zookeeper.ZooKeeper - Closing session: 0x2453d08efcac215 2014-04-14 17:05:11 DEBUG zookeeper.ClientCnxn - Closing client for session: 0x2453d08efcac215 2014-04-14 17:05:11 DEBUG zookeeper.ClientCnxn - Reading reply sessionid:0x2453d08efcac215, packet:: clientPath:null serverPath:null finished:false header:: 2123,-11 replyHeader:: 2123,73101958105,0 request:: null response:: null 2014-04-14 17:05:11 DEBUG zookeeper.ClientCnxn - Disconnecting client for session: 0x2453d08efcac215 2014-04-14 17:05:11 INFO zookeeper.ZooKeeper - Session: 0x2453d08efcac215 closed 2014-04-14 17:05:11 DEBUG zkclient.ZkClient - Closing ZkClient...done 2014-04-14 17:05:11 DEBUG zookeeper.ClientCnxn - An exception was thrown while closing send thread for session 0x2453d08efcac215 : Unable to read additional data from server sessionid 0x2453d08efcac215, likely server has closed socket 2014-04-14 17:05:11 INFO zookeeper.ClientCnxn - EventThread shut down If I perform a topic list, I will see the partitions balanced between the brokers. However if I check in zookeeper, I see the following admin node still existing: [zk: tm1-zookeeper101:2181(CONNECTED) 0] get /admin/preferred_replica_election {"version":1,"partitions":[{"topic":"account-info-updated","partition":36}, *large list of topic and partitions* ,{"topic":"rain-enhanced-consumer","partition":5}]} cZxid = 0x1105375bd8 ctime = Mon Apr 14 17:05:11 MDT 2014 mZxid = 0x1105375bd8 mtime = Mon Apr 14 17:05:11 MDT 2014 pZxid = 0x1105375bd8 cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x0 dataLength = 94578 numChildren = 0 It appears that node was not removed after the election occurred. It looks like this is taking over 30 seconds. Perhaps this process is taking over 30 seconds as defined in the zookeeper session timeout? If I re-run the election (with the admin node still in place), I get the following error: Performing Rebalancing Process... 2014-04-14 17:18:13 DEBUG zkclient.ZkConnection - Creating new ZookKeeper instance to connect to tvip-m1-mw-zookeeper.dish.com:2181. 2014-04-14 17:18:13 INFO zkclient.ZkEventThread - Starting ZkClient event thread. 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:host.name=tm1mwdpl05 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:java.version=1.7.0_51 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:java.vendor=Oracle Corporation 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:java.home=/apps/des/jdk1.7.0_51/jre 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:java.class.path=:/apps/des/kafka/bin/../core/build/dependant-libs-2.8.0/*.jar:/apps/des/kafka/bin/../perf/build/libs//kafka-perf_2.8.0*.jar:/apps/des/kafka/bin/../clients/build/libs//kafka-clients*.jar:/apps/des/kafka/bin/../examples/build/libs//kafka-examples*.jar:/apps/des/kafka/bin/../contrib/hadoop-consumer/build/libs//kafka-hadoop-consumer*.jar:/apps/des/kafka/bin/../contrib/hadoop-producer/build/libs//kafka-hadoop-producer*.jar:/apps/des/kafka/bin/../libs/jopt-simple-3.2.jar:/apps/des/kafka/bin/../libs/kafka_2.10-0.8.1.jar:/apps/des/kafka/bin/../libs/log4j-1.2.15.jar:/apps/des/kafka/bin/../libs/metrics-annotation-2.2.0.jar:/apps/des/kafka/bin/../libs/metrics-core-2.2.0.jar:/apps/des/kafka/bin/../libs/scala-library-2.10.1.jar:/apps/des/kafka/bin/../libs/slf4j-api-1.7.2.jar:/apps/des/kafka/bin/../libs/snappy-java-1.0.5.jar:/apps/des/kafka/bin/../libs/zkclient-0.3.jar:/apps/des/kafka/bin/../libs/zookeeper-3.3.4.jar:/apps/des/kafka/bin/../core/build/libs/kafka_2.8.0*.jar 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:java.library.path=/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:java.io.tmpdir=/tmp 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:java.compiler=<NA> 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:os.name=Linux 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:os.arch=amd64 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:os.version=2.6.32-220.el6.x86_64 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:user.name=desadmin 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:user.home=/apps/des 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Client environment:user.dir=/apps/des/bin 2014-04-14 17:18:13 INFO zookeeper.ZooKeeper - Initiating client connection, connectString=tvip-m1-mw-zookeeper.dish.com:2181 sessionTimeout=30000 watcher=org.I0Itec.zkclient.ZkClient@73250065 2014-04-14 17:18:13 DEBUG zookeeper.ClientCnxn - zookeeper.disableAutoWatchReset is false 2014-04-14 17:18:13 DEBUG zkclient.ZkClient - Awaiting connection to Zookeeper server 2014-04-14 17:18:13 DEBUG zkclient.ZkClient - Waiting for keeper state SyncConnected 2014-04-14 17:18:13 INFO zookeeper.ClientCnxn - Opening socket connection to server tvip-m1-mw-zookeeper.dish.com/10.18.60.25:2181 2014-04-14 17:18:13 INFO zookeeper.ClientCnxn - Socket connection established to tvip-m1-mw-zookeeper.dish.com/10.18.60.25:2181, initiating session 2014-04-14 17:18:13 DEBUG zookeeper.ClientCnxn - Session establishment request sent on tvip-m1-mw-zookeeper.dish.com/10.18.60.25:2181 2014-04-14 17:18:13 INFO zookeeper.ClientCnxn - Session establishment complete on server tvip-m1-mw-zookeeper.dish.com/10.18.60.25:2181, sessionid = 0x1453d08efa2c2b7, negotiated timeout = 30000 2014-04-14 17:18:13 DEBUG zkclient.ZkClient - Received event: WatchedEvent state:SyncConnected type:None path:null 2014-04-14 17:18:13 INFO zkclient.ZkClient - zookeeper state changed (SyncConnected) 2014-04-14 17:18:13 DEBUG zkclient.ZkClient - Leaving process event 2014-04-14 17:18:13 DEBUG zkclient.ZkClient - State is SyncConnected 2014-04-14 17:18:13 DEBUG zookeeper.ClientCnxn - Reading reply sessionid:0x1453d08efa2c2b7, packet:: clientPath:null serverPath:null finished:false header:: 1,8 replyHeader:: 1,73102061236,0 request:: '/brokers/topics,F response:: v{'bill-generated,'bob_test_topic,'rain-burn-in,'account-adjustment-issued,'account-access,'customer-cues,'submit-agreement,'logtest_v3,'account-created,'rain-load-test,'outbound-call-attempted,'account-info-updated,'threshold-exceeded,'rain-enhanced-consumer,'account-activated,'outbound-communications,'rain-enhanced} 2014-04-14 17:18:13 DEBUG zookeeper.ClientCnxn - Reading reply sessionid:0x1453d08efa2c2b7, packet:: clientPath:null serverPath:null finished:false header:: 2,8 replyHeader:: 2,73102061261,0 request:: '/brokers/topics/bill-generated/partitions,F response:: v{'35,'36,'33,'34,'39,'37,'38,'43,'42,'41,'40,'22,'23,'24,'25,'26,'27,'28,'29,'3,'2,'1,'0,'7,'30,'6,'5,'32,'31,'4,'9,'8,'19,'17,'18,'15,'16,'13,'14,'11,'12,'21,'20,'109,'108,'107,'106,'105,'104,'103,'99,'102,'101,'100,'98,'97,'96,'95,'94,'93,'92,'91,'90,'10,'88,'89,'116,'79,'117,'114,'78,'77,'115,'112,'113,'110,'111,'118,'119,'82,'83,'80,'81,'86,'87,'84,'85,'67,'125,'126,'66,'69,'127,'68,'128,'121,'122,'123,'124,'129,'70,'120,'71,'72,'73,'74,'75,'76,'59,'58,'57,'56,'55,'64,'65,'62,'63,'60,'130,'61,'49,'48,'45,'44,'47,'46,'51,'52,'53,'54,'50} *large list of logging statements* Failed to start preferred replica election kafka.common.AdminCommandFailedException: Admin command failed at kafka.admin.PreferredReplicaLeaderElectionCommand.moveLeaderToPreferredReplica(PreferredReplicaLeaderElectionCommand.scala:115) at kafka.admin.PreferredReplicaLeaderElectionCommand$.main(PreferredReplicaLeaderElectionCommand.scala:60) at kafka.admin.PreferredReplicaLeaderElectionCommand.main(PreferredReplicaLeaderElectionCommand.scala) Caused by: kafka.admin.AdminOperationException: Preferred replica leader election currently in progress for Set([account-adjustment-issued,77], *large list of topics and partitions* [submit-agreement,31]). Aborting operation at kafka.admin.PreferredReplicaLeaderElectionCommand$.writePreferredReplicaElectionData(PreferredReplicaLeaderElectionCommand.scala:101) at kafka.admin.PreferredReplicaLeaderElectionCommand.moveLeaderToPreferredReplica(PreferredReplicaLeaderElectionCommand.scala:113) ... 2 more 2014-04-14 17:19:00 DEBUG zkclient.ZkClient - Closing ZkClient... 2014-04-14 17:19:00 INFO zkclient.ZkEventThread - Terminate ZkClient event thread. 2014-04-14 17:19:00 DEBUG zkclient.ZkConnection - Closing ZooKeeper connected to tvip-m1-mw-zookeeper.dish.com:2181 2014-04-14 17:19:00 DEBUG zookeeper.ZooKeeper - Closing session: 0x1453d08efa2c2b7 2014-04-14 17:19:00 DEBUG zookeeper.ClientCnxn - Closing client for session: 0x1453d08efa2c2b7 2014-04-14 17:19:00 DEBUG zookeeper.ClientCnxn - Reading reply sessionid:0x1453d08efa2c2b7, packet:: clientPath:null serverPath:null finished:false header:: 2124,-11 replyHeader:: 2124,73102067456,0 request:: null response:: null 2014-04-14 17:19:00 DEBUG zookeeper.ClientCnxn - Disconnecting client for session: 0x1453d08efa2c2b7 2014-04-14 17:19:00 INFO zookeeper.ZooKeeper - Session: 0x1453d08efa2c2b7 closed 2014-04-14 17:19:00 DEBUG zookeeper.ClientCnxn - An exception was thrown while closing send thread for session 0x1453d08efa2c2b7 : Unable to read additional data from server sessionid 0x1453d08efa2c2b7, likely server has closed socket 2014-04-14 17:19:00 INFO zookeeper.ClientCnxn - EventThread shut down 2014-04-14 17:19:00 DEBUG zkclient.ZkClient - Closing ZkClient...done If I manually remove the /admin/preferred_replica_election node, the java exception goes away. A bit of other information: At worst ParNew GC time: 25-36 ms (total time per 15 second interval, 1-2 GCs per 15 second interval) per Broker during this operation. No CMS GCing occurring. Any ideas? Thanks /bob