Re: Network failure leads to split brain in brokers

Guozhang Wang Fri, 20 Jun 2014 08:20:12 -0700

Could you try with the latest release 0.8.1.1? It has a bunch of critical
bug fixes which may solve your problem.



On Thu, Jun 19, 2014 at 11:01 PM, Abhinav Anand <ab.rv...@gmail.com> wrote:

> Hi Guozhang,
> I am using 0.8.0 release
>
> Regards,
> Abhinav
>
>
> On Thu, Jun 19, 2014 at 2:06 AM, Guozhang Wang <wangg...@gmail.com> wrote:
>
> > Hello Abhinav,
> >
> > Which Kafka version are you using?
> >
> > Guozhang
> >
> >
> > On Wed, Jun 18, 2014 at 1:40 AM, Abhinav Anand <ab.rv...@gmail.com>
> wrote:
> >
> > > Hi Guys,
> > >  We have a 6 node cluster. Due to network failure yesterday(network was
> > up
> > > and running in few minutes), few of the brokers were not able to talk
> to
> > > each other, including the zookeeper. During an issue in our camus job,
> we
> > > identified that broker-2 was misbehaving. The controller logs reflect
> > that
> > > the broker-2 is not alive, though broker-2 was serving all
> > > metadata/consumer requests and its log don't show nasty errors.
> > >
> > > If you see the camus logs below, any request to broker-2 would return
> > > broker-2 as the leader.
> > > Even though list-topic would show (1,3) as ISR and 3 as the leader.
> This
> > > state is pretty disturbing as its difficult to detect and inherently
> not
> > > supposed to state.
> > >
> > > #### *Log analysis *####
> > > For brevity i have attached my analysis in Kafka_split_brain.log file,
> > > reflected the key points in the mail
> > >
> > > Network failure occurred somewhere around 2014-06-16 20:30 hrs
> > > *###Controller Logs (Time ordered)*
> > >
> > > 1. Controller fails to send request to broker 2
> > >
> > > [2014-06-16 20:23:34,197] WARN [Controller-3-to-broker-2-send-thread],
> > > Controller 3 fails to send a request to broker 2
> > > (kafka.controller.RequestSendThread)
> > > 2. NoReplicaOnlineException shows broker 2 to be not live
> > >
> > > [2014-06-16 21:40:14,351] ERROR Controller 3 epoch 63 initiated state
> > > change for partition [same-topic-name,40] from OfflinePartition to
> > > OnlinePartition failed (state.change.logger)
> > > kafka.common.NoReplicaOnlineException: No replica for partition
> > > [same-topic-name,40] is alive. Live brokers are: [Set(5, 1, 6, 3, 4)],
> > > Assigned replicas are: [List(2)]
> > > 3. After restarting the broker today, the controller logs selects 2 as
> > > ISR.
> > >
> > > [2014-06-17 21:26:14,109] WARN [OfflinePartitionLeaderSelector]: No
> > > broker in ISR is alive for [bfd-logstream_9092-PROD,40]. Elect leader 2
> > > from live brokers 2. There's potential data loss.
> > > (kafka.controller.OfflinePartitionLeaderSelector)
> > >
> > > *###Broker-2 Logs*
> > >
> > > 1. When controller sends replica-request, warning invalid LeaderAndIsr
> > >
> > >     - [2014-06-16 20:23:31,482] WARN Broker 2 received invalid
> > >       LeaderAndIsr request with correlation id 0 from controller 3
> epoch
> > 63 with
> > >       an older leader epoch 41 for partition [cse-sauron,1], current
> > >       leader epoch is 41 (state.change.logger)
> > >
> > > 2. On restart today
> > >
> > >     - [2014-06-17 21:26:14,135] ERROR Broker 2 aborted the
> > >       become-follower state change with correlation id 32 from
> > controller 3 epoch
> > >       63 for partition [bfd-logstream_9092-PROD,82] new leader -1
> > >       (state.change.logger)
> > >
> > >
> > >
> > > *###Where it all Started (Camus logs)*
> > >
> > > We have a camus job which runs every half-hour, during some of the
> > > yesterdays run we had duplicate data creeping in. On investigation, we
> > > found out whenever the metadata request was sent to broker-id 2,
> > broker-id
> > > 2 used to declare itself as leader. If the request was sent to other
> > > brokers, the leader was always broker 3. On list topic, broker (1,3)
> were
> > > ISR and 3 was the leader.  Though if we see the camus log for
> yesterday’s
> > > run, it finds broker-2 as the leader for one of its run.
> > >
> > >
> > >    - Job Run at 2014-06-17 00:06:34
> > >       - 2014-06-17 00:06:34 INFO:
> com.linkedin.camus.etl.kafka.CamusJob -
> > >       Fetching metadata from broker broker-1 with client id cse-sauron
> > for 0
> > >       topic(s) []
> > >       2014-06-17 00:06:34 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> > >       cse-sauron uri:tcp://broker-3-url leader:3 partition:0
> > >       offset:10693174 latest_offset:10693174
> > >    - Job Run at 2014-06-17 00:24:58
> > >       - 2014-06-17 00:24:58 INFO:
> com.linkedin.camus.etl.kafka.CamusJob -
> > >       Fetching metadata from broker 2 with client id cse-sauron for 0
> > topic(s) []
> > >       2014-06-17 00:24:58 ERROR: com.linkedin.camus.etl.kafka.CamusJob
> -
> > >       The current offset was found to be more than the latest offset
> > >       - 2014-06-17 00:24:58 ERROR:
> com.linkedin.camus.etl.kafka.CamusJob
> > >       - Moving to the earliest offset available
> > >       - 2014-06-17 00:24:58 INFO:
> com.linkedin.camus.etl.kafka.CamusJob -
> > >       cse-sauron uri:tcp://broker-2-url leader:2 partition:0
> > >       offset:9747106 latest_offset:10674802
> > >    - Job Run at 2014-06-17 01:01:54:
> > >       - 2014-06-17 01:01:54 INFO:
> com.linkedin.camus.etl.kafka.CamusJob -
> > >       Fetching metadata from broker dare-broker02:9092 with client id
> > cse-sauron
> > >       for 0 topic(s) []
> > >       2014-06-17 01:01:54 INFO: com.linkedin.camus.etl.kafka.CamusJob -
> > >       cse-sauron uri:tcp://dare-broker02.sv.walmartlabs.com:9092
> > leader:3
> > >       partition:0 offset:10253311 latest_offset:10697656
> > >
> > >  Let me know if you need any more details
> > >
> > > --
> > >
> > > Abhinav Anand
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>
>
>
> --
> Abhinav Anand
>



-- 
-- Guozhang

Re: Network failure leads to split brain in brokers

Reply via email to