Re: Kafka 0.8 Failover Behavior

Jun Rao Fri, 21 Jun 2013 20:26:27 -0700

Hi, Bob,

Thanks for reporting this. Yes, this is the current behavior when all
brokers fail. Whichever broker comes back first becomes the new leader and
is the source of truth. This increases availability. However, previously
committed data can be lost. This is what we call unclean leader elections.
Another option is instead to wait until a broker in in-sync replica set to
come back before electing a new leader. This will preserve all committed
data at the expense of availability. The application can configure the
system with the appropriate option based on its need.


Thanks,

Jun


On Fri, Jun 21, 2013 at 4:08 PM, Bob Jervis <bjer...@visibletechnologies.com
> wrote:

> I wanted to send this out because we saw this in some testing we were
> doing and wanted to advise the community of something to watch for in 0.8
> HA support.
>
> We have a two machine cluster with replication factor 2.  We took one
> machine offline and re-formatted the disk.  We re-installed the Kafka
> software, but did not recreate any of the local disk files.  The intention
> was to simply re-start the broker process, but due to an error in the
> network config that took some time to diagnose, we ended up with the both
> machines' brokers down.
>
> When we fixed the network config and restarted the brokers, we happened to
> start the broker on the rebuilt machine first.  The net result was when the
> healthy broker came back online, the rebuilt machine was already the leader
> and because of the Zookeeper state, it force the healthy broker to delete
> all of its topic data, thus wiping out the entire contents of the cluster.
>
> We are instituting operations procedures to safeguard against this
> scenario in the future (and fortunately we only blew away a test cluster),
> but this was a bit of a nasty surprise for a Friday.
>
> Bob Jervis
> Visibletechnologies
>
>

Re: Kafka 0.8 Failover Behavior

Reply via email to