Re: Network failure leads to split brain in brokers

Abhinav Anand Thu, 19 Jun 2014 23:03:06 -0700

Hi Guozhang,
I am using 0.8.0 release

Regards,
Abhinav



On Thu, Jun 19, 2014 at 2:06 AM, Guozhang Wang <wangg...@gmail.com> wrote:

> Hello Abhinav,
>
> Which Kafka version are you using?
>
> Guozhang
>
>
> On Wed, Jun 18, 2014 at 1:40 AM, Abhinav Anand <ab.rv...@gmail.com> wrote:
>
> > Hi Guys,
> >  We have a 6 node cluster. Due to network failure yesterday(network was
> up
> > and running in few minutes), few of the brokers were not able to talk to
> > each other, including the zookeeper. During an issue in our camus job, we
> > identified that broker-2 was misbehaving. The controller logs reflect
> that
> > the broker-2 is not alive, though broker-2 was serving all
> > metadata/consumer requests and its log don't show nasty errors.
> >
> > If you see the camus logs below, any request to broker-2 would return
> > broker-2 as the leader.
> > Even though list-topic would show (1,3) as ISR and 3 as the leader. This
> > state is pretty disturbing as its difficult to detect and inherently not
> > supposed to state.
> >
> > #### *Log analysis *####
> > For brevity i have attached my analysis in Kafka_split_brain.log file,
> > reflected the key points in the mail
> >
> > Network failure occurred somewhere around 2014-06-16 20:30 hrs
> > *###Controller Logs (Time ordered)*
> >
> > 1. Controller fails to send request to broker 2
> >
> > [2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread],
> > Controller 3 fails to send a request to broker 2
> > (kafka.controller.RequestSendThread)
> > 2. NoReplicaOnlineException shows broker 2 to be not live
> >
> > [2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state
> > change for partition [same-topic-name,40] from OfflinePartition to
> > OnlinePartition failed (state.change.logger)
> > kafka.common.NoReplicaOnlineException: No replica for partition
> > [same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)],
> > Assigned replicas are: [List(2)]
> > 3. After restarting the broker today, the controller logs selects 2 as
> > ISR.
> >
> > [2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No
> > broker in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2
> > from live brokers 2. There's potential data loss.
> > (kafka.controller.OfflinePartitionLeaderSelector)
> >
> > *###Broker-2 Logs*
> >
> > 1. When controller sends replica-request, warning invalid LeaderAndIsr
> >
> >     - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid
> >       LeaderAndIsr request with correlation id 0 from controller 3 epoch
> 63 with
> >       an older leader epoch 41 for partition [cse-sauron,1], current
> >       leader epoch is 41 (state.change.logger)
> >
> > 2. On restart today
> >
> >     - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the
> >       become-follower state change with correlation id 32 from
> controller 3 epoch
> >       63 for partition [bfd-logstream_9092-PROD,82] new leader -1
> >       (state.change.logger)
> >
> >
> >
> > *###Where it all Started (Camus logs)*
> >
> > We have a camus job which runs every half-hour, during some of the
> > yesterdays run we had duplicate data creeping in. On investigation, we
> > found out whenever the metadata request was sent to broker-id 2,
> broker-id
> > 2 used to declare itself as leader. If the request was sent to other
> > brokers, the leader was always broker 3. On list topic, broker (1,3) were
> > ISR and 3 was the leader.  Though if we see the camus log for yesterday’s
> > run, it finds broker-2 as the leader for one of its run.
> >
> >
> >    - Job Run at 2014-06-17 00:06:34
> >       - 2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       Fetching metadata from broker broker-1 with client id cse-sauron
> for 0
> >       topic(s) []
> >       2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       cse-sauron uri:tcp://broker-3-url leader:3 partition:0
> >       offset:10693174 latest_offset:10693174
> >    - Job Run at 2014-06-17 00:24:58
> >       - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       Fetching metadata from broker 2 with client id cse-sauron for 0
> topic(s) []
> >       2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob -
> >       The current offset was found to be more than the latest offset
> >       - 2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob
> >       - Moving to the earliest offset available
> >       - 2014-06-17 00:24:58 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       cse-sauron uri:tcp://broker-2-url leader:2 partition:0
> >       offset:9747106 latest_offset:10674802
> >    - Job Run at 2014-06-17 01:01:54:
> >       - 2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       Fetching metadata from broker dare-broker02:9092 with client id
> cse-sauron
> >       for 0 topic(s) []
> >       2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> >       cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092
> leader:3
> >       partition:0 offset:10253311 latest_offset:10697656
> >
> >  Let me know if you need any more details
> >
> > --
> >
> > Abhinav Anand
> >
>
>
>
> --
> -- Guozhang
>



-- 
Abhinav Anand

Re: Network failure leads to split brain in brokers

Reply via email to