Re: Operationalizing Zookeeper and common gotchas

Ryanne Dolan Mon, 18 Mar 2019 10:36:10 -0700

Eno, I found this useful, thanks.

Ryanne


On Mon, Mar 18, 2019, 12:16 PM Eno Thereska <eno.there...@gmail.com> wrote:

> Hi folks,
>
> The team here has come up with a couple of clarifying tips for
> operationalizing Zookeeper for Kafka that we found missing from the
> official documentation, and passed them along to share. If you find them
> useful, I'm thinking of putting on
> https://cwiki.apache.org/confluence/display/KAFKA/FAQ. Meanwhile any
> feedback is appreciated.
>
> -------
> Operationalizing Zookeeper FAQ
>
> The discussion below uses a 3-instance Zookeeper cluster as an example. The
> findings apply to a larger cluster as well, but you’ll need to adjust the
> numbers.
>
> - Does it make sense to have a config with only 2 Zookeeper instances?
> I.e., in zookeeper.properties file have two entries for server 1 and server
> 2 only. A: No. A setup with 2 Zookeeper instances is not fault tolerant to
> even 1 failure. If one of the Zookeeper instances fails, the remaining one
> will not be functional since there is no quorum majority (1 out of 2 is not
> majority). If you do a “stat” command on that remaining instance you’ll see
> the output being “This ZooKeeper instance is not currently serving
> requests”.
>
> - What if you end up with only 2 running Zookeeper instances, e.g., you
> started with 3 but one failed? Isn’t that the same as the case above? A: No
> it’s not the same scenario. First of all, the 3- instance setup did
> tolerate 1 instance down. The 2 remaining Zookeeper instances will continue
> to function because the quorum majority (2 out of 3) is there.
>
> - I had a 3 Zookeeper instance setup and one instance just failed. How
> should I recover? A: Restart the failed instance with the same
> configuration it had before (i.e., same “myid” ID file, and same IP
> address). It is not important to recover the data volume of the failed
> instance, but it is a bonus if you do so. Once the instance comes up, it
> will sync with the other 2 Zookeeper instances and get all the data.
>
> - I had a 3 Zookeeper instance setup and two instances failed. How should I
> recover? Is my Zookeeper cluster even running at that point? A: First of
> all, ZooKeeper is now unavailable and the remaining instance will show
> “This ZooKeeper instance is not currently serving requests” if probed.
> Second, you should make sure this situation is extremely rare. It should be
> possible to recover the first failed instance quickly before the second
> instance fails. Third, bring up the two failed instances one by one without
> changing anything in their config. Similarly to the case above, it is not
> important to recover the data volume of the failed instance, but it is a
> bonus if you do so. Once the instance comes up, it will sync with the other
> 1 ZooKeeper instance and get all the data.
>
> - I had a 3 Zookeeper instance setup and two instances failed. I can’t
> recover the failed instances for whatever reason. What should I do? A: You
> will have to restart the remaining healthy ZooKeeper in “standalone” mode
> and restart all the brokers and point them to this standalone zookeeper
> (instead of all 3 ZooKeepers).
>
> - The Zookeeper cluster is unavailable (for any of the reasons mentioned
> above, e.g., no quorum, all instances have failed). What is the impact on
> Kafka clients? What is the impact on brokers? A: The Zookeeper cluster is
> unavailable (for any of the reasons mentioned above, e.g., no quorum, all
> instances have failed). What is the impact on Kafka applications
> producing/consuming? What is the impact on admin tools to manage topics and
> cluster? What is the impact on brokers? A: Applications will be able to
> continue producing and consuming, at least for a while. This is true if the
> ZooKeeper cluster is temporarily unavailable but eventually becomes
> available (after a few mins). On the other hand, if the ZooKeeper cluster
> is permanently unavailable, then applications will slowly start to see
> problems with producing/consuming especially if some brokers fail, because
> the partition leaders will not be distributed to other brokers. So taking
> one extreme, if the ZooKeeper cluster is down for a month, it is very
> likely that applications will get produce/consume errors. Admin tools
> (e.g., that create topics, set ACLs or change configs) will not work.
> Brokers will not be impacted from Zookeeper being unavailable. They will
> periodically try to reconnect to the ZooKeeper cluster. If you take care to
> use the same IP address for a recovered Zookeeper instance as it had before
> it failed, brokers will not need to be restarted.
> ------
>
> Cheers,
> Eno
>

Re: Operationalizing Zookeeper and common gotchas

Reply via email to