Hello Abhinav, Which Kafka version are you using?
Guozhang On Wed, Jun 18, 2014 at 1:40 AM, Abhinav Anand <ab.rv...@gmail.com> wrote: > Hi Guys, > We have a 6 node cluster. Due to network failure yesterday(network was up > and running in few minutes), few of the brokers were not able to talk to > each other, including the zookeeper. During an issue in our camus job, we > identified that broker-2 was misbehaving. The controller logs reflect that > the broker-2 is not alive, though broker-2 was serving all > metadata/consumer requests and its log don't show nasty errors. > > If you see the camus logs below, any request to broker-2 would return > broker-2 as the leader. > Even though list-topic would show (1,3) as ISR and 3 as the leader. This > state is pretty disturbing as its difficult to detect and inherently not > supposed to state. > > #### *Log analysis *#### > For brevity i have attached my analysis in Kafka_split_brain.log file, > reflected the key points in the mail > > Network failure occurred somewhere around 2014-06-16 20:30 hrs > *###Controller Logs (Time ordered)* > > 1. Controller fails to send request to broker 2 > > [2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread], > Controller 3 fails to send a request to broker 2 > (kafka.controller.RequestSendThread) > 2. NoReplicaOnlineException shows broker 2 to be not live > > [2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state > change for partition [same-topic-name,40] from OfflinePartition to > OnlinePartition failed (state.change.logger) > kafka.common.NoReplicaOnlineException: No replica for partition > [same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)], > Assigned replicas are: [List(2)] > 3. After restarting the broker today, the controller logs selects 2 as > ISR. > > [2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No > broker in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2 > from live brokers 2. There's potential data loss. > (kafka.controller.OfflinePartitionLeaderSelector) > > *###Broker-2 Logs* > > 1. When controller sends replica-request, warning invalid LeaderAndIsr > > - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid > LeaderAndIsr request with correlation id 0 from controller 3 epoch 63 > with > an older leader epoch 41 for partition [cse-sauron,1], current > leader epoch is 41 (state.change.logger) > > 2. On restart today > > - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the > become-follower state change with correlation id 32 from controller 3 > epoch > 63 for partition [bfd-logstream_9092-PROD,82] new leader -1 > (state.change.logger) > > > > *###Where it all Started (Camus logs)* > > We have a camus job which runs every half-hour, during some of the > yesterdays run we had duplicate data creeping in. On investigation, we > found out whenever the metadata request was sent to broker-id 2, broker-id > 2 used to declare itself as leader. If the request was sent to other > brokers, the leader was always broker 3. On list topic, broker (1,3) were > ISR and 3 was the leader. Though if we see the camus log for yesterday’s > run, it finds broker-2 as the leader for one of its run. > > > - Job Run at 2014-06-17 00:06:34 > - 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob - > Fetching metadata from broker broker-1 with client id cse-sauron for 0 > topic(s) [] > 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob - > cse-sauron uri:tcp://broker-3-url leader:3 partition:0 > offset:10693174 latest_offset:10693174 > - Job Run at 2014-06-17 00:24:58 > - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob - > Fetching metadata from broker 2 with client id cse-sauron for 0 > topic(s) [] > 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob - > The current offset was found to be more than the latest offset > - 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob > - Moving to the earliest offset available > - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob - > cse-sauron uri:tcp://broker-2-url leader:2 partition:0 > offset:9747106 latest_offset:10674802 > - Job Run at 2014-06-17 01:01:54: > - 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob - > Fetching metadata from broker dare-broker02:9092 with client id > cse-sauron > for 0 topic(s) [] > 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob - > cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092 leader:3 > partition:0 offset:10253311 latest_offset:10697656 > > Let me know if you need any more details > > -- > > Abhinav Anand > -- -- Guozhang