Off the top of my head, it looks like it lost network connectivity to some extent.
Question - what settings were used for topics like efGamePlay? What is min insync replicas, replication factor, and what acks settings is the producer using? Cheers, Liam On Fri, 3 Jun 2022 at 22:55, dhiraj prajapati <dhirajp...@gmail.com> wrote: > Hi all, > Recently we faced an issue with one of our production kafka clusters: > - It is a 3 node cluster > - kafka server version is 1.0 > > *Issue*: > One of the brokers had some problem resulting in the following: > 1. The broker lost leadership of all of the topic-parttions > 2. However the kafka server process did *NOT* get stopped/killed. > > I have attached the exception traces from controller and server logs at > that time at the end of this email. > > *Questions:* > 1. What could be the reason behind this happening? > 2. How can I reproduce this exact scenario in our test environment? > > P.S. We did not see any GC logs, or general network blips at around that > time. > > *Exception trace in controller log:* > > [2022-05-21 09:07:45,914] INFO [RequestSendThread controllerId=2] Starting > (kafka.controller.RequestSendThread) > [2022-05-21 09:07:45,914] INFO [RequestSendThread controllerId=2] Starting > (kafka.controller.RequestSendThread) > [2022-05-21 09:07:48,354] ERROR [Controller id=2] Error while electing or > becoming controller on broker 2 (kafka.controller.KafkaController) > kafka.zookeeper.ZooKeeperClientExpiredException: Session expired either > before or while waiting for connection > at > > kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:233) > at > > kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221) > at > > kafka.zookeeper.ZooKeeperClient$$anonfun$kafka$zookeeper$ZooKeeperClient$$waitUntilConnected$1.apply(ZooKeeperClient.scala:221) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) > at > > kafka.zookeeper.ZooKeeperClient.kafka$zookeeper$ZooKeeperClient$$waitUntilConnected(ZooKeeperClient.scala:221) > at > > kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply$mcV$sp(ZooKeeperClient.scala:215) > at > > kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:215) > at > > kafka.zookeeper.ZooKeeperClient$$anonfun$waitUntilConnected$1.apply(ZooKeeperClient.scala:215) > at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:250) > at > > kafka.zookeeper.ZooKeeperClient.waitUntilConnected(ZooKeeperClient.scala:214) > at > > kafka.zk.KafkaZkClient.retryRequestsUntilConnected(KafkaZkClient.scala:1461) > at > > kafka.zk.KafkaZkClient.kafka$zk$KafkaZkClient$$retryRequestUntilConnected(KafkaZkClient.scala:1437) > at > kafka.zk.KafkaZkClient.getPartitionReassignment(KafkaZkClient.scala:671) > at > > kafka.controller.KafkaController.initializePartitionReassignment(KafkaController.scala:698) > at > > kafka.controller.KafkaController.initializeControllerContext(KafkaController.scala:655) > at > > kafka.controller.KafkaController.onControllerFailover(KafkaController.scala:232) > at > > kafka.controller.KafkaController.kafka$controller$KafkaController$$elect(KafkaController.scala:1203) > at > > kafka.controller.KafkaController$Reelect$.process(KafkaController.scala:1479) > at > > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply$mcV$sp(ControllerEventManager.scala:69) > at > > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69) > at > > kafka.controller.ControllerEventManager$ControllerEventThread$$anonfun$doWork$1.apply(ControllerEventManager.scala:69) > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31) > at > > kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:68) > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > [2022-05-21 09:07:48,358] DEBUG [Controller id=2] Resigning > (kafka.controller.KafkaController) > [2022-05-21 09:07:48,358] DEBUG [Controller id=2] Unregister > BrokerModifications handler for Set(2, 1) > (kafka.controller.KafkaController) > [2022-05-21 09:07:48,360] INFO [PartitionStateMachine controllerId=2] > Stopped partition state machine (kafka.controller.PartitionStateMachine) > [2022-05-21 09:07:48,361] INFO [ReplicaStateMachine controllerId=2] Stopped > replica state machine (kafka.controller.ReplicaStateMachine) > [2022-05-21 09:07:48,361] INFO [RequestSendThread controllerId=2] Shutting > down (kafka.controller.RequestSendThread) > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Stopped > (kafka.controller.RequestSendThread) > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutdown > completed (kafka.controller.RequestSendThread) > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutting > down (kafka.controller.RequestSendThread) > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Stopped > (kafka.controller.RequestSendThread) > [2022-05-21 09:07:48,364] INFO [RequestSendThread controllerId=2] Shutdown > completed (kafka.controller.RequestSendThread) > [2022-05-21 09:07:48,365] INFO [Controller id=2] Resigned > (kafka.controller.KafkaController) > > > *Snippets from server log:* > > [2022-05-21 09:06:36,678] INFO [Log partition=efAcquisition-18, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 252314842 in 2 ms. (kafka.log.Log) > [2022-05-21 09:06:53,106] INFO [GroupCoordinator 2]: Preparing to rebalance > group > spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 > with old generation 1 (__consumer_offsets-43) > (kafka.coordinator.group.GroupCoordinator) > [2022-05-21 09:06:53,107] INFO [GroupCoordinator 2]: Group > spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 > with generation 2 is now empty (__consumer_offsets-43) > (kafka.coordinator.group.GroupCoordinator) > [2022-05-21 09:06:59,793] INFO [GroupCoordinator 2]: Preparing to rebalance > group > spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 > with old generation 1 (__consumer_offsets-37) > (kafka.coordinator.group.GroupCoordinator) > [2022-05-21 09:06:59,793] INFO [GroupCoordinator 2]: Group > spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 > with generation 2 is now empty (__consumer_offsets-37) > (kafka.coordinator.group.GroupCoordinator) > [2022-05-21 09:07:21,154] INFO [ProducerStateManager partition=hh_topic-9] > Writing producer snapshot at offset 7837005056 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:21,155] INFO [Log partition=hh_topic-9, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 7837005056 in 2 ms. (kafka.log.Log) > [2022-05-21 09:07:26,088] INFO [ProducerStateManager > partition=efGamePlay-11] Writing producer snapshot at offset 2116578569 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:26,089] INFO [Log partition=efGamePlay-11, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 2116578569 in 2 ms. (kafka.log.Log) > [2022-05-21 09:07:27,170] INFO [ProducerStateManager > partition=efGamePlay-18] Writing producer snapshot at offset 2116607621 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:27,171] INFO [Log partition=efGamePlay-18, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 2116607621 in 2 ms. (kafka.log.Log) > [2022-05-21 09:07:30,444] INFO [ProducerStateManager > partition=auditTopic-36] Writing producer snapshot at offset 145646657 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:30,445] INFO [Log partition=auditTopic-36, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 145646657 in 2 ms. (kafka.log.Log) > [2022-05-21 09:07:33,189] INFO [ProducerStateManager > partition=heartbeats-0] Writing producer snapshot at offset 43292127 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:33,190] INFO [Log partition=heartbeats-0, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 43292127 in 3 ms. (kafka.log.Log) > [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Group > spark-kafka-source-c2b4063b-baf3-40aa-af8b-1c0ea4246ca8-196930323-driver-0 > transitioned to Dead in generation 2 > (kafka.coordinator.group.GroupMetadataManager) > [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Group > spark-kafka-source-800e8ff9-02f4-4c34-bfc9-2d73d723b4e3-1912945116-driver-0 > transitioned to Dead in generation 2 > (kafka.coordinator.group.GroupMetadataManager) > [2022-05-21 09:07:35,760] INFO [GroupMetadataManager brokerId=2] Removed 0 > expired offsets in 0 milliseconds. > (kafka.coordinator.group.GroupMetadataManager) > [2022-05-21 09:07:37,543] INFO [ProducerStateManager > partition=auditTopic-0] Writing producer snapshot at offset 145649350 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:37,544] INFO [Log partition=auditTopic-0, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 145649350 in 2 ms. (kafka.log.Log) > [2022-05-21 09:07:39,207] INFO [ProducerStateManager > partition=taxonomyOthers-48] Writing producer snapshot at offset 162628223 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:39,207] INFO [Log partition=taxonomyOthers-48, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 162628223 in 2 ms. (kafka.log.Log) > [2022-05-21 09:07:40,103] INFO Creating /controller (is it secure? false) > (kafka.zk.KafkaZkClient) > [2022-05-21 09:07:40,115] INFO Result of znode creation at /controller is: > OK (kafka.zk.KafkaZkClient) > [2022-05-21 09:07:45,915] WARN Client session timed out, have not heard > from server in 5138ms for sessionid 0x27b85a3ead1005a > (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:45,916] INFO Client session timed out, have not heard > from server in 5138ms for sessionid 0x27b85a3ead1005a, closing socket > connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:45,933] WARN Attempting to send response via channel for > which there is no open connection, connection id 10.200.3.174:9092 > -10.200.2.18:33360-5632853 (kafka.network.Processor) > [2022-05-21 09:07:45,936] WARN Attempting to send response via channel for > which there is no open connection, connection id 10.200.3.174:9092 > -10.200.3.29:18284-5535253 (kafka.network.Processor) > [2022-05-21 09:07:45,995] WARN Attempting to send response via channel for > which there is no open connection, connection id 10.200.3.174:9092 > -10.200.4.128:6078-5535526 (kafka.network.Processor) > [2022-05-21 09:07:45,996] WARN Attempting to send response via channel for > which there is no open connection, connection id 10.200.3.174:9092 > -10.200.4.129:6664-5535480 (kafka.network.Processor) > [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Waiting until connected. > (kafka.zookeeper.ZooKeeperClient) > [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Connected. > (kafka.zookeeper.ZooKeeperClient) > [2022-05-21 09:07:46,902] INFO Opening socket connection to server 10.200/ > 10.0.0.200:2181. Will not attempt to authenticate using SASL (unknown > error) (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,027] WARN Client session timed out, have not heard > from server in 2001ms for sessionid 0x27b85a3ead1005a > (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,027] INFO Client session timed out, have not heard > from server in 2001ms for sessionid 0x27b85a3ead1005a, closing socket > connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,129] INFO [ZooKeeperClient] Waiting until connected. > (kafka.zookeeper.ZooKeeperClient) > [2022-05-21 09:07:46,028] INFO [ZooKeeperClient] Connected. > (kafka.zookeeper.ZooKeeperClient) > [2022-05-21 09:07:46,902] INFO Opening socket connection to server 10.200/ > 10.0.0.200:2181. Will not attempt to authenticate using SASL (unknown > error) (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,027] WARN Client session timed out, have not heard > from server in 2001ms for sessionid 0x27b85a3ead1005a > (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,027] INFO Client session timed out, have not heard > from server in 2001ms for sessionid 0x27b85a3ead1005a, closing socket > connection and attempting reconnect (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,129] INFO [ZooKeeperClient] Waiting until connected. > (kafka.zookeeper.ZooKeeperClient) > [2022-05-21 09:07:48,294] INFO Opening socket connection to server > 10.200.3.174/10.200.3.174:2181. Will not attempt to authenticate using > SASL > (unknown error) (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,294] INFO Socket connection established to > 10.200.3.174/10.200.3.174:2181, initiating session > (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,313] WARN Unable to reconnect to ZooKeeper service, > session 0x27b85a3ead1005a has expired (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,314] INFO Unable to reconnect to ZooKeeper service, > session 0x27b85a3ead1005a has expired, closing socket connection > (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,314] INFO EventThread shut down for session: > 0x27b85a3ead1005a (org.apache.zookeeper.ClientCnxn) > [2022-05-21 09:07:48,318] INFO [ZooKeeperClient] Session expired. > (kafka.zookeeper.ZooKeeperClient) > [2022-05-21 09:07:48,396] INFO [ProducerStateManager > partition=efGamePlay-41] Writing producer snapshot at offset 2116645612 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:48,397] INFO [Log partition=efGamePlay-41, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 2116645612 in 2 ms. (kafka.log.Log) > [2022-05-21 09:07:49,452] INFO [ProducerStateManager > partition=efAcquisition-47] Writing producer snapshot at offset 252224167 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:49,453] INFO [Log partition=efAcquisition-47, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 252224167 in 3 ms. (kafka.log.Log) > [2022-05-21 09:07:57,475] INFO [ProducerStateManager > partition=efGamePlay-5] Writing producer snapshot at offset 2116617455 > (kafka.log.ProducerStateManager) > [2022-05-21 09:07:57,476] INFO [Log partition=efGamePlay-5, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Rolled new log segment at > offset 2116617455 in 2 ms. (kafka.log.Log) > [2022-05-21 09:08:00,642] INFO [Partition prism_sqldb_topic-55 broker=2] > Shrinking ISR from 2,1 to 2 (kafka.cluster.Partition) > [2022-05-21 09:08:00,782] WARN Attempting to send response via channel for > which there is no open connection, connection id 10.200.3.174:9092 > -10.200.2.192:52088-5632843 (kafka.network.Processor) > [2022-05-21 09:08:05,321] INFO [Log partition=efGamePlay-41, > dir=/home/tomcat/kafka_2.11-0.9.0.1/commitlogs] Found deletable segments > with base offsets [2116567157] due to retention time 3600000ms breach > (kafka.log.Log) >