Operationalizing Zookeeper and common gotchas

Eno Thereska Mon, 18 Mar 2019 10:17:01 -0700

Hi folks,

The team here has come up with a couple of clarifying tips for
operationalizing Zookeeper for Kafka that we found missing from the
official documentation, and passed them along to share. If you find them
useful, I'm thinking of putting on
https://cwiki.apache.org/confluence/display/KAFKA/FAQ. Meanwhile any
feedback is appreciated.


-------
Operationalizing Zookeeper FAQ

The discussion below uses a 3-instance Zookeeper cluster as an example. The
findings apply to a larger cluster as well, but you’ll need to adjust the
numbers.

- Does it make sense to have a config with only 2 Zookeeper instances?
I.e., in zookeeper.properties file have two entries for server 1 and server
2 only. A: No. A setup with 2 Zookeeper instances is not fault tolerant to
even 1 failure. If one of the Zookeeper instances fails, the remaining one
will not be functional since there is no quorum majority (1 out of 2 is not
majority). If you do a “stat” command on that remaining instance you’ll see
the output being “This ZooKeeper instance is not currently serving
requests”.

- What if you end up with only 2 running Zookeeper instances, e.g., you
started with 3 but one failed? Isn’t that the same as the case above? A: No
it’s not the same scenario. First of all, the 3- instance setup did
tolerate 1 instance down. The 2 remaining Zookeeper instances will continue
to function because the quorum majority (2 out of 3) is there.

- I had a 3 Zookeeper instance setup and one instance just failed. How
should I recover? A: Restart the failed instance with the same
configuration it had before (i.e., same “myid” ID file, and same IP
address). It is not important to recover the data volume of the failed
instance, but it is a bonus if you do so. Once the instance comes up, it
will sync with the other 2 Zookeeper instances and get all the data.

- I had a 3 Zookeeper instance setup and two instances failed. How should I
recover? Is my Zookeeper cluster even running at that point? A: First of
all, ZooKeeper is now unavailable and the remaining instance will show
“This ZooKeeper instance is not currently serving requests” if probed.
Second, you should make sure this situation is extremely rare. It should be
possible to recover the first failed instance quickly before the second
instance fails. Third, bring up the two failed instances one by one without
changing anything in their config. Similarly to the case above, it is not
important to recover the data volume of the failed instance, but it is a
bonus if you do so. Once the instance comes up, it will sync with the other
1 ZooKeeper instance and get all the data.

- I had a 3 Zookeeper instance setup and two instances failed. I can’t
recover the failed instances for whatever reason. What should I do? A: You
will have to restart the remaining healthy ZooKeeper in “standalone” mode
and restart all the brokers and point them to this standalone zookeeper
(instead of all 3 ZooKeepers).

- The Zookeeper cluster is unavailable (for any of the reasons mentioned
above, e.g., no quorum, all instances have failed). What is the impact on
Kafka clients? What is the impact on brokers? A: The Zookeeper cluster is
unavailable (for any of the reasons mentioned above, e.g., no quorum, all
instances have failed). What is the impact on Kafka applications
producing/consuming? What is the impact on admin tools to manage topics and
cluster? What is the impact on brokers? A: Applications will be able to
continue producing and consuming, at least for a while. This is true if the
ZooKeeper cluster is temporarily unavailable but eventually becomes
available (after a few mins). On the other hand, if the ZooKeeper cluster
is permanently unavailable, then applications will slowly start to see
problems with producing/consuming especially if some brokers fail, because
the partition leaders will not be distributed to other brokers. So taking
one extreme, if the ZooKeeper cluster is down for a month, it is very
likely that applications will get produce/consume errors. Admin tools
(e.g., that create topics, set ACLs or change configs) will not work.
Brokers will not be impacted from Zookeeper being unavailable. They will
periodically try to reconnect to the ZooKeeper cluster. If you take care to
use the same IP address for a recovered Zookeeper instance as it had before
it failed, brokers will not need to be restarted.
------

Cheers,
Eno

Operationalizing Zookeeper and common gotchas

Reply via email to