Could you try with the latest release 0.8.1.1? It has a bunch of critical bug fixes which may solve your problem.
On Thu, Jun 19, 2014 at 11:01 PM, Abhinav Anand <ab.rv...@gmail.com> wrote: > Hi Guozhang, > I am using 0.8.0 release > > Regards, > Abhinav > > > On Thu, Jun 19, 2014 at 2:06 AM, Guozhang Wang <wangg...@gmail.com> wrote: > > > Hello Abhinav, > > > > Which Kafka version are you using? > > > > Guozhang > > > > > > On Wed, Jun 18, 2014 at 1:40 AM, Abhinav Anand <ab.rv...@gmail.com> > wrote: > > > > > Hi Guys, > > > We have a 6 node cluster. Due to network failure yesterday(network was > > up > > > and running in few minutes), few of the brokers were not able to talk > to > > > each other, including the zookeeper. During an issue in our camus job, > we > > > identified that broker-2 was misbehaving. The controller logs reflect > > that > > > the broker-2 is not alive, though broker-2 was serving all > > > metadata/consumer requests and its log don't show nasty errors. > > > > > > If you see the camus logs below, any request to broker-2 would return > > > broker-2 as the leader. > > > Even though list-topic would show (1,3) as ISR and 3 as the leader. > This > > > state is pretty disturbing as its difficult to detect and inherently > not > > > supposed to state. > > > > > > #### *Log analysis *#### > > > For brevity i have attached my analysis in Kafka_split_brain.log file, > > > reflected the key points in the mail > > > > > > Network failure occurred somewhere around 2014-06-16 20:30 hrs > > > *###Controller Logs (Time ordered)* > > > > > > 1. Controller fails to send request to broker 2 > > > > > > [2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread], > > > Controller 3 fails to send a request to broker 2 > > > (kafka.controller.RequestSendThread) > > > 2. NoReplicaOnlineException shows broker 2 to be not live > > > > > > [2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state > > > change for partition [same-topic-name,40] from OfflinePartition to > > > OnlinePartition failed (state.change.logger) > > > kafka.common.NoReplicaOnlineException: No replica for partition > > > [same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)], > > > Assigned replicas are: [List(2)] > > > 3. After restarting the broker today, the controller logs selects 2 as > > > ISR. > > > > > > [2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No > > > broker in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2 > > > from live brokers 2. There's potential data loss. > > > (kafka.controller.OfflinePartitionLeaderSelector) > > > > > > *###Broker-2 Logs* > > > > > > 1. When controller sends replica-request, warning invalid LeaderAndIsr > > > > > > - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid > > > LeaderAndIsr request with correlation id 0 from controller 3 > epoch > > 63 with > > > an older leader epoch 41 for partition [cse-sauron,1], current > > > leader epoch is 41 (state.change.logger) > > > > > > 2. On restart today > > > > > > - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the > > > become-follower state change with correlation id 32 from > > controller 3 epoch > > > 63 for partition [bfd-logstream_9092-PROD,82] new leader -1 > > > (state.change.logger) > > > > > > > > > > > > *###Where it all Started (Camus logs)* > > > > > > We have a camus job which runs every half-hour, during some of the > > > yesterdays run we had duplicate data creeping in. On investigation, we > > > found out whenever the metadata request was sent to broker-id 2, > > broker-id > > > 2 used to declare itself as leader. If the request was sent to other > > > brokers, the leader was always broker 3. On list topic, broker (1,3) > were > > > ISR and 3 was the leader. Though if we see the camus log for > yesterday’s > > > run, it finds broker-2 as the leader for one of its run. > > > > > > > > > - Job Run at 2014-06-17 00:06:34 > > > - 2014-06-17 00:06:34 INFO: > com.linkedin.camus.etl.kafka.CamusJob - > > > Fetching metadata from broker broker-1 with client id cse-sauron > > for 0 > > > topic(s) [] > > > 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob - > > > cse-sauron uri:tcp://broker-3-url leader:3 partition:0 > > > offset:10693174 latest_offset:10693174 > > > - Job Run at 2014-06-17 00:24:58 > > > - 2014-06-17 00:24:58 INFO: > com.linkedin.camus.etl.kafka.CamusJob - > > > Fetching metadata from broker 2 with client id cse-sauron for 0 > > topic(s) [] > > > 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob > - > > > The current offset was found to be more than the latest offset > > > - 2014-06-17 00:24:58 ERROR: > com.linkedin.camus.etl.kafka.CamusJob > > > - Moving to the earliest offset available > > > - 2014-06-17 00:24:58 INFO: > com.linkedin.camus.etl.kafka.CamusJob - > > > cse-sauron uri:tcp://broker-2-url leader:2 partition:0 > > > offset:9747106 latest_offset:10674802 > > > - Job Run at 2014-06-17 01:01:54: > > > - 2014-06-17 01:01:54 INFO: > com.linkedin.camus.etl.kafka.CamusJob - > > > Fetching metadata from broker dare-broker02:9092 with client id > > cse-sauron > > > for 0 topic(s) [] > > > 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob - > > > cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092 > > leader:3 > > > partition:0 offset:10253311 latest_offset:10697656 > > > > > > Let me know if you need any more details > > > > > > -- > > > > > > Abhinav Anand > > > > > > > > > > > -- > > -- Guozhang > > > > > > -- > Abhinav Anand > -- -- Guozhang