I wanted to send this out because we saw this in some testing we were doing and 
wanted to advise the community of something to watch for in 0.8 HA support.

We have a two machine cluster with replication factor 2.  We took one machine 
offline and re-formatted the disk.  We re-installed the Kafka software, but did 
not recreate any of the local disk files.  The intention was to simply re-start 
the broker process, but due to an error in the network config that took some 
time to diagnose, we ended up with the both machines' brokers down.

When we fixed the network config and restarted the brokers, we happened to 
start the broker on the rebuilt machine first.  The net result was when the 
healthy broker came back online, the rebuilt machine was already the leader and 
because of the Zookeeper state, it force the healthy broker to delete all of 
its topic data, thus wiping out the entire contents of the cluster.

We are instituting operations procedures to safeguard against this scenario in 
the future (and fortunately we only blew away a test cluster), but this was a 
bit of a nasty surprise for a Friday.

Bob Jervis
Visibletechnologies

Reply via email to