I'm new to this community and I don't necessarily have all the background around the complaints on ZooKeeper, but I'd like to give my opinion, even if biased. To introduce myself in the case folks here don't know me, I'm one of the guys who developed ZooKeeper originally and designed the Zab protocol.
> On 01 Dec 2015, at 19:58, Jay Kreps <j...@confluent.io> wrote: > > Hey Joe, > > Thanks for raising this. People really want to get rid of the ZK > dependency, I agree it is among the most asked for things. Let me give a > quick critique and a more radical plan. > Putting aside the cases in which people already use or want to use something else, what's the concern here exactly? People are always trying to get rid of dependencies but they give up when they start looking at the alternatives or the difficulty of doing themselves. Also, during this past few months that I have been interacting with this community, I noticed at least one issue that was around for a long time, was causing some pain, and it was due to a poor understanding of the ZK semantics. I'm not convinced that doing your own will make it go away, unless you have a group of people dedicated to doing it, but then you're doing ZK, etcd, Consul all over again, with the difference that it is in your backyard. > I don't think making ZK pluggable is the right thing to do. I have a lot of > experience with this dynamic of introducing plugins for core functionality > because I previously worked on a key-value store called Voldemort in which > we made both the protocol and storage engine totally pluggable. I > originally felt this was a good thing both philosophically and practically, > but in retrospect came to believe it was a huge mistake--what people really > wanted was one really excellent implementation with the kind of insane > levels of in-production usage and test coverage that infrastructure > demands. Pluggability is actually really at odds with this, and the ability > to actually abstract over some really meaty dependency like a storage > engine never quite works. Adding another layer of abstraction is likely to cause even more pain. Right now there is ZkUtils -> ZkClient -> ZooKeeper client. > > People dislike the ZK dependency because it effectively doubles the > operational load of Kafka--it doubles the amount of configuration, > monitoring, and understanding needed. I agree, but it isn't necessarily a problem with the dependency, but how people think about the dependency. Taking as an example the recent security work, SASL was done separately for brokers and zk. It would make configuration easier if we just said SASL and under the hood configured what is necessary. Let me stress that I'm not complaining about the security work, and it is even possible that they were separated for different reasons, like a zk ensemble being shared, but I feel this is an example in which the developers in this community tend to think of it separately. Actually, one interesting point is that some of the security apparently was borrowed from zk. Not a problem, it is an open-source project after all, but there could be better synergy here. > Replacing ZK with a similar system > won't fix this problem though--all the other consensus services are equally > complex (and often less mature)--and it will cause two new problems. First > there will be a layer of indirection that will make reasoning and improving > the ZK implementation harder. For example, note that your plug-in api > doesn't seem to cover multi-get and multi-write, when we added that we > would end up breaking all plugins. Each new thing will be like that. Ops > tools, config, documentation, etc will no longer be able to include any > coverage of ZK because we can't assume ZK so all that becomes much harder. > The second problem is that this introduces a combinatorial testing problem. > People say they want to swap out ZK but they are assuming whatever they > swap in will work equally well. How will we know that is true? The only way > to explode out the testing to run with every possible plugin. That's possibly less of a problem if you have a single instance that the community supports. Others can implement their option, but the community doesn't necessarily have to maintain or make sure they all work prior to releases. > > If you want to see this in action take a look at ActiveMQ. ActiveMQ is less > a system than a family of co-operating plugins and a configuration language > for assembling them. Software engineers and open source communities are > really prone to this kind of thing because "we can just make it pluggable" > ends any argument. But the actual implementation is a mess, and later > improvements in their threading, I/O, and other core models simply couldn't > be made across all the plugins. > > This blog post on configurability in UI is a really good summary of a > similar dynamic: > http://ometer.com/free-software-ui.html > > Anyhow, not to go too far off on a rant. Clearly I have plugin PTSD :-) > > I think instead we should explore the idea of getting rid of the zookeeper > dependency and replace it with an internal facility. Let me explain what I > mean. In terms of API what Kafka and ZK do is super different, but > internally it is actually quite similar--they are both trying to maintain a > CP log. Yes, but interestingly your CP log depends on ZK for consensus. It is accurate that the correctness of ZK depends on a correct replication of the txn log, though, but it is naive to think that the step to get a correct broadcast primitive is small. > > What would actually make the system significantly simpler would be to > reimplement the facilities you describe on top of Kafka's existing > infrastructure--using the same log implementation, network stack, config, > monitoring, etc. If done correctly this would dramatically lower the > operational load of the system versus the current Kafka+ZK or proposed > Kafka+X. Perhaps the general understanding of these protocols have advanced enough that it doesn't make sense any longer to dependent on something like ZK, etcd, or Consul. One of the reasons for doing ZK in the first place was exactly to free developers from the burden of dealing with the complexity of broadcast/consensus protocols. The main problem that I've seen more than once is that the homebred version ends up abandoned because it isn't the main focus of the project, and causes more pain. One of the things that worked with ZK was to have a community and a group of people responsible for thinking about it and maintaining it. > > I don't have a proposal for how this would work and it's some effort to > scope it out. The obvious thing to do would just be to keep the existing > ISR/Controller setup and rebuild the controller etc on a RAFT/Paxos impl > using the Kafka network/log/etc and have a replicated config database > (maybe rocksdb) that was fed off the log and shared by all nodes. > In principle, you need mechanisms like watches, ephemerals, and just pure storage. Just implement a replicated state machine with those operations. Watches and ephemerals make it a bit hard to do using a REST API, but it is fine if you don't care about doing long polls. Perhaps this is too much detail for the status of this discussion. > If done well this could have the advantage of potentially allowing us to > scale the number of partitions quite significantly (the k/v store would not > need to be all in memory), though you would likely still have limits on the > number of partitions per machine. This would make the minimum Kafka cluster > size be just your replication factor. > The idea of not having it in memory to scale isn't a bad one, specially if you SSDs to cache data and such. > People tend to feel that implementing things like RAFT or Paxos is too hard > for mere mortals. But I actually think it is within our capabilities, and > our testing capabilities as well as experience with this type of thing have > improved to the point where we should not be scared off if it is the right > path. It is hard, but certainly not impossible. I have implemented quite a few Paxos prototypes over the years for experiments, and getting off the ground isn't hard, but getting an implementation that really works and that it maintainable is difficult. I've seen more than one instance where people got it wrong and caused pain. But again, maybe times have changed and that won't happen here. -Flavio > > This approach is likely more work then plugins (though maybe not, once you > factor in all the docs, testing, etc) but if done correctly it would be an > unambiguous step forward--a simpler, more scalable implementation with no > operational dependencies. > > Thoughts? > > -Jay > > > > > > On Tue, Dec 1, 2015 at 11:12 AM, Joe Stein <joe.st...@stealth.ly> wrote: > >> I would like to start a discussion around the work that has started in >> regards to KIP-30 >> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems >> >> The impetus for working on this came a lot from the community. For the last >> year(~+) it has been the most asked question at any talk I have given >> (personally speaking). It has come up a bit also on the mailing list >> talking about zkclient vs currator. A lot of folks want to use Kafka but >> introducing dependencies are hard for the enterprise so the goals behind >> this is making it so that using Kafka can be done as easy as possible for >> the operations teams to-do when they do. If they are already supporting >> ZooKeeper they can keep doing that but if not they want (users) to use >> something else they are already supporting that can plug-in to-do the same >> things. >> >> For the core project I think we should leave in upstream what we have. This >> gives a great baseline regression for folks and makes the work for "making >> what we have plug-able work" a good defined task (carve out, layer in API >> impl, push back tests pass). From there then when folks want their >> implementation to be something besides ZooKeeper they can develop, test and >> support that if they choose. >> >> We would like to suggest that we have the plugin interface be Java based >> for minimizing depends for JVM impl. This could be in another directory >> something TBD /<name>. >> >> If you have a server you want to try to get it working but you aren't on >> the JVM don't be afraid just think about a REST impl and if you can work >> inside of that you have some light RPC layers (this was the first pass >> prototype we did to flush-out the public api presented on the KIP). >> >> There are a lot of parts to working on this and the more implementations we >> have the better we can flush out the public interface. I will leave the >> technical details and design to JIRA tickets that are linked through the >> confluence page as these decisions come about and code starts for reviews >> and we can target the specific modules having the context separate is >> helpful especially if multiple folks are working on it. >> https://issues.apache.org/jira/browse/KAFKA-2916 >> >> Do other folks want to build implementations? Maybe we should start a >> confluence page for those or use an existing one and add to it so we can >> coordinate some there to. >> >> Thanks! >> >> ~ Joe Stein >> - - - - - - - - - - - - - - - - - - - >> [image: Logo-Black.jpg] >> http://www.elodina.net >> http://www.stealth.ly >> - - - - - - - - - - - - - - - - - - - >>