Hi Guys, We have a 6 node cluster. Due to network failure yesterday(network was up and running in few minutes), few of the brokers were not able to talk to each other, including the zookeeper. During an issue in our camus job, we identified that broker-2 was misbehaving. The controller logs reflect that the broker-2 is not alive, though broker-2 was serving all metadata/consumer requests and its log don't show nasty errors.
If you see the camus logs below, any request to broker-2 would return broker-2 as the leader. Even though list-topic would show (1,3) as ISR and 3 as the leader. This state is pretty disturbing as its difficult to detect and inherently not supposed to state. #### *Log analysis *#### For brevity i have attached my analysis in Kafka_split_brain.log file, reflected the key points in the mail Network failure occurred somewhere around 2014-06-16 20:30 hrs *###Controller Logs (Time ordered)* 1. Controller fails to send request to broker 2 [2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread], Controller 3 fails to send a request to broker 2 (kafka.controller.RequestSendThread) 2. NoReplicaOnlineException shows broker 2 to be not live [2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state change for partition [same-topic-name,40] from OfflinePartition to OnlinePartition failed (state.change.logger) kafka.common.NoReplicaOnlineException: No replica for partition [same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)], Assigned replicas are: [List(2)] 3. After restarting the broker today, the controller logs selects 2 as ISR. [2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No broker in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2 from live brokers 2. There's potential data loss. (kafka.controller.OfflinePartitionLeaderSelector) *###Broker-2 Logs* 1. When controller sends replica-request, warning invalid LeaderAndIsr - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid LeaderAndIsr request with correlation id 0 from controller 3 epoch 63 with an older leader epoch 41 for partition [cse-sauron,1], current leader epoch is 41 (state.change.logger) 2. On restart today - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the become-follower state change with correlation id 32 from controller 3 epoch 63 for partition [bfd-logstream_9092-PROD,82] new leader -1 (state.change.logger) *###Where it all Started (Camus logs)* We have a camus job which runs every half-hour, during some of the yesterdays run we had duplicate data creeping in. On investigation, we found out whenever the metadata request was sent to broker-id 2, broker-id 2 used to declare itself as leader. If the request was sent to other brokers, the leader was always broker 3. On list topic, broker (1,3) were ISR and 3 was the leader. Though if we see the camus log for yesterday’s run, it finds broker-2 as the leader for one of its run. - Job Run at 2014-06-17 00:06:34 - 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob - Fetching metadata from broker broker-1 with client id cse-sauron for 0 topic(s) [] 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob - cse-sauron uri:tcp://broker-3-url leader:3 partition:0 offset:10693174 latest_offset:10693174 - Job Run at 2014-06-17 00:24:58 - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob - Fetching metadata from broker 2 with client id cse-sauron for 0 topic(s) [] 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob - The current offset was found to be more than the latest offset - 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob - Moving to the earliest offset available - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob - cse-sauron uri:tcp://broker-2-url leader:2 partition:0 offset:9747106 latest_offset:10674802 - Job Run at 2014-06-17 01:01:54: - 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob - Fetching metadata from broker dare-broker02:9092 with client id cse-sauron for 0 topic(s) [] 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob - cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092 leader:3 partition:0 offset:10253311 latest_offset:10697656 Let me know if you need any more details -- Abhinav Anand