Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

Jay Kreps Thu, 03 Dec 2015 11:08:29 -0800

Hey Joel,

I agree this isn't the most pressing thing for the project (i.e. we
shouldn't start now), but I think that is a separate question for
which direction we would head in.


I think whether it is "worth doing" kind of depends on what you
optimize for. If you consider just the problem of "getting rid of
zookeeper" in a large existing environment like LinkedIn, I agree this
doesn't have a ton of value. LinkedIn has already done the legwork of
operationalizing Zookeeper (which as you'll recall took us several
years to actually master). Once you've mastered it, though, it isn't a
big problem. I think the bigger improvement from that proposal from
LinkedIn's point of view would be removing the hard limit on partition
count by co-locating metadata and moving it out of memory.

But if instead you consider the average user adopting Kafka they are
often starting with very small three node clusters. Zookeeper is
effectively doubling their footprint (at least). They have several
years of operational learning to do to master it, and it roughly
doubles what they need to learn to run. For these people it is a
significant penalty. This is why this is the first question at
virtually every talk and one of the most requested things. I don't
think we can deny that people want this, and when you think about why,
and how their usage differs from a large established environment, I'm
not sure you can say they are wrong to want that. I suspect correcting
this would not particularly help big already established users who
have already mastered ZK but would probably roughly double the rate of
adoption.

-Jay

On Thu, Dec 3, 2015 at 7:18 AM, Joel Koshy <jjkosh...@gmail.com> wrote:
> I’m on an extended vacation but "a native consensus implementation in
> Kafka" is one of my favorite topics so I’m compelled to chime in :) I have
> actually been thinking along the same lines that Jay is saying after
> reading the Raft paper - a couple of us at LinkedIn have had casual
> discussions about this (i.e., a Raft-like implementation embedded in Kafka)
> but at this point I’m not sure it is at the point of being a practical or
> necessary effort for the project per-se: I have heard from a few people
> that operating ZooKeeper is difficult but haven’t quite understood why this
> is so. It does need some care in selecting hardware and configuration but
> after that it should just work right? I really think the major concerns are
> caused by incorrect usage in the application (i.e., Kafka) which results in
> weird Kafka cluster behavior and then charging zookeeper for it.
> Furthermore, Raft-like implementations sound doable but I did have the
> concern that it will be very difficult in practice to perfect the
> implementation and Flavio is assuring us that it is indeed difficult :)
>
> Although it is unclear to me if it is worthwhile doing a native consensus
> implementation in Kafka and although I’m willing to take Flavio’s word for
> it given his experience with ZK and although I'm generally a proponent
> of McIlroy's
> views in such decisions
> <https://en.wikipedia.org/wiki/Unix_philosophy#Doug_McIlroy_on_Unix_programming>,
> I do think it is a super cool project to try out anyway given that the
> commit log is Kafka’s core abstraction and I think the network layer/other
> layers are really solid. It may just turn out to be really really elegant
> and will be a “nice-have” sort of win. i.e., I think it will be nice to not
> have to run one more service for Kafka and neat to have a native consensus
> implementation that we understand and know really well. At this point I’m
> not sure it will be a very big win though since I don’t quite see the
> issues in running ZooKeeper alongside Kafka and it will mean bringing in a
> fair deal of new complexity to Kafka itself.
>
> On the topic of wrapper libraries: from a developer perspective it is easy
> to use ZooKeeper incorrectly and I’m guessing there are more than a few
> anti-patterns in our code especially since we use a wrapper library. So I
> agree with Neha’s viewpoint that using wrappers such as zkclient/curator
> without being well-versed in the wrapper code itself is another cause for
> subtle bugs in our usage. Removing that wrapper is an interesting option
> but would again be a fairly significant effort - this is in effect going to
> be like reimplementing our own wrapper library around ZK. Another option is
> to just have a few committers/contributors to be the “experts/owners” of
> our zkclient/zookeeper usage. E.g., we have contributed a few fixes back to
> zkclient after diving into the code while debugging various issues. However
> these may take some time to get merged in and subsequently pulled into
> Kafka.
>
> On Tue, Dec 1, 2015 at 5:49 PM, Jay Kreps <j...@confluent.io> wrote:
>
>> Hey Flavio,
>>
>> Yeah, I think we are largely in agreement on virtually all points.
>>
>> Where I saw ZK shine was really in in-house infrastructure. LinkedIn had a
>> dozen in-house systems that all used it, and it wouldn't have made sense
>> for any of those systems to build their own. Likewise when we started Kafka
>> there was really only 1-3 developers for a very long time so doing anything
>> more custom would have been out of reach. I guess the characteristic of
>> in-house infrastructure is that it has to be cheap to build, and it often
>> ends up having lots and lots of other system dependencies which is fine so
>> long as they are things you already run.
>>
>> For an open source product, though you are kind of optimizing with a
>> different objective function. You are trying to make the thing easy to get
>> going with and willing to spend more time to get there. That out-of-the-box
>> experience of how easy it is to adopt and operationalize is the big issue
>> in how successful the system is. I think using external consensus systems
>> ends up not being quite as good here because many people won't already have
>> the dependency as part of their stack and for them you effectively double
>> the operational footprint they have to master. I think this is why this is
>> such a loud and persistent complaint (Joe is right, it is the first
>> question asked at every talk I give)--they want to adopt one distributed
>> thingy (Kafka) which is hard enough to monitor, configure, understand etc,
>> but to get that we make them learn a second one too. Even when people
>> already have zk in their stack that doesn't really help because it turns
>> out that sharing the zk cluster safely is usually not easy for people.
>>
>> Actually ZK itself is really great in this way--i think a lot of it's
>> success comes from being a totally simple standalone component. If it
>> required some other system to run (dunno what, but say HDFS or MySQL or
>> whatever) I think it would be far less successful.
>>
>> Obviously a lot depends on how much is reused and what the conceptual
>> "footprint" of the implementation is. If you end up with something in Kafka
>> that is basically all standalone code (basically a ZK-like service but less
>> tested and in scala) then that would be totally stupid. I think that is the
>> outcome you are describing.
>>
>> On the other hand it might be possible to reuse much more. I have not done
>> a design for something like this so all of this is totally half baked. But
>> imagine you implement Raft on Kafka (Kraft!). Basically you have a magical
>> kafka log "metadata" and appends, recovery, and leadership for this log are
>> not handled using the normal ISR logic but go through a special code path
>> that implements the raft algorithm. A subset of 3-5 brokers would be full
>> participants in maintaining this log and the rest of the cluster would just
>> passively consume it. All brokers have a callback plugin for processing
>> events appended to the log (this is what handles things like topic creates
>> and deletes), and all have a local rocksdb key-value store replicating the
>> contents of the log which is queryable. This k/v store holds the various
>> topic configs, metadata entries, etc. You would implement a new
>> RequestVotes RPC for doing leadership election. The leader of the Raft
>> election would also serve as the controller for the duration of it's term.
>>
>> I'm not claiming this is well thought out, it's obviously not. It also
>> might not be what you ended up with if you fully rethought the relationship
>> between the consensus log and algorithm used for that and the data logs and
>> algorithm used for them. Buuut, pretend for a moment this is possible and
>> the implementation was not buggy. I claim that this setup would be much
>> operationally simpler then operationalizing, securing, monitoring, and
>> configuring both Kafka and a separate zk/consul/etcd cluster. The impact is
>> some additional logic for processing the metadata appends, maybe some
>> changes to the log implementation, and a new RPC for leadership election.
>>
>> Thoughts?
>>
>> -Jay
>>
>> On Tue, Dec 1, 2015 at 2:25 PM, Flavio Junqueira <f...@apache.org> wrote:
>>
>> > I'm new to this community and I don't necessarily have all the background
>> > around the complaints on ZooKeeper, but I'd like to give my opinion, even
>> > if biased. To introduce myself in the case folks here don't know me, I'm
>> > one of the guys who developed ZooKeeper originally and designed the Zab
>> > protocol.
>> >
>> > > On 01 Dec 2015, at 19:58, Jay Kreps <j...@confluent.io> wrote:
>> > >
>> > > Hey Joe,
>> > >
>> > > Thanks for raising this. People really want to get rid of the ZK
>> > > dependency, I agree it is among the most asked for things. Let me give
>> a
>> > > quick critique and a more radical plan.
>> > >
>> >
>> > Putting aside the cases in which people already use or want to use
>> > something else, what's the concern here exactly? People are always trying
>> > to get rid of dependencies but they give up when they start looking at
>> the
>> > alternatives or the difficulty of doing themselves. Also, during this
>> past
>> > few months that I have been interacting with this community, I noticed at
>> > least one issue that was around for a long time, was causing some pain,
>> and
>> > it was due to a poor understanding of the ZK semantics. I'm not convinced
>> > that doing your own will make it go away, unless you have a group of
>> people
>> > dedicated to doing it, but then you're doing ZK, etcd, Consul all over
>> > again, with the difference that it is in your backyard.
>> >
>> > > I don't think making ZK pluggable is the right thing to do. I have a
>> lot
>> > of
>> > > experience with this dynamic of introducing plugins for core
>> > functionality
>> > > because I previously worked on a key-value store called Voldemort in
>> > which
>> > > we made both the protocol and storage engine totally pluggable. I
>> > > originally felt this was a good thing both philosophically and
>> > practically,
>> > > but in retrospect came to believe it was a huge mistake--what people
>> > really
>> > > wanted was one really excellent implementation with the kind of insane
>> > > levels of in-production usage and test coverage that infrastructure
>> > > demands. Pluggability is actually really at odds with this, and the
>> > ability
>> > > to actually abstract over some really meaty dependency like a storage
>> > > engine never quite works.
>> >
>> > Adding another layer of abstraction is likely to cause even more pain.
>> > Right now there is ZkUtils -> ZkClient -> ZooKeeper client.
>> >
>> > >
>> > > People dislike the ZK dependency because it effectively doubles the
>> > > operational load of Kafka--it doubles the amount of configuration,
>> > > monitoring, and understanding needed.
>> >
>> > I agree, but it isn't necessarily a problem with the dependency, but how
>> > people think about the dependency. Taking as an example the recent
>> security
>> > work, SASL was done separately for brokers and zk. It would make
>> > configuration easier if we just said SASL and under the hood configured
>> > what is necessary. Let me stress that I'm not complaining about the
>> > security work, and it is even possible that they were separated for
>> > different reasons, like a zk ensemble being shared, but I feel this is an
>> > example in which the developers in this community tend to think of it
>> > separately. Actually, one interesting point is that some of the security
>> > apparently was borrowed from zk. Not a problem, it is an open-source
>> > project after all, but there could be better synergy here.
>> >
>> > > Replacing ZK with a similar system
>> > > won't fix this problem though--all the other consensus services are
>> > equally
>> > > complex (and often less mature)--and it will cause two new problems.
>> > First
>> > > there will be a layer of indirection that will make reasoning and
>> > improving
>> > > the ZK implementation harder. For example, note that your plug-in api
>> > > doesn't seem to cover multi-get and multi-write, when we added that we
>> > > would end up breaking all plugins. Each new thing will be like that.
>> Ops
>> > > tools, config, documentation, etc will no longer be able to include any
>> > > coverage of ZK because we can't assume ZK so all that becomes much
>> > harder.
>> > > The second problem is that this introduces a combinatorial testing
>> > problem.
>> > > People say they want to swap out ZK but they are assuming whatever they
>> > > swap in will work equally well. How will we know that is true? The only
>> > way
>> > > to explode out the testing to run with every possible plugin.
>> >
>> > That's possibly less of a problem if you have a single instance that the
>> > community supports. Others can implement their option, but the community
>> > doesn't necessarily have to maintain or make sure they all work prior to
>> > releases.
>> >
>> > >
>> > > If you want to see this in action take a look at ActiveMQ. ActiveMQ is
>> > less
>> > > a system than a family of co-operating plugins and a configuration
>> > language
>> > > for assembling them. Software engineers and open source communities are
>> > > really prone to this kind of thing because "we can just make it
>> > pluggable"
>> > > ends any argument. But the actual implementation is a mess, and later
>> > > improvements in their threading, I/O, and other core models simply
>> > couldn't
>> > > be made across all the plugins.
>> > >
>> > > This blog post on configurability in UI is a really good summary of a
>> > > similar dynamic:
>> > > http://ometer.com/free-software-ui.html
>> > >
>> > > Anyhow, not to go too far off on a rant. Clearly I have plugin PTSD :-)
>> > >
>> > > I think instead we should explore the idea of getting rid of the
>> > zookeeper
>> > > dependency and replace it with an internal facility. Let me explain
>> what
>> > I
>> > > mean. In terms of API what Kafka and ZK do is super different, but
>> > > internally it is actually quite similar--they are both trying to
>> > maintain a
>> > > CP log.
>> >
>> > Yes, but interestingly your CP log depends on ZK for consensus. It is
>> > accurate that the correctness of ZK depends on a correct replication of
>> the
>> > txn log, though, but it is naive to think that the step to get a correct
>> > broadcast primitive is small.
>> >
>> > >
>> > > What would actually make the system significantly simpler would be to
>> > > reimplement the facilities you describe on top of Kafka's existing
>> > > infrastructure--using the same log implementation, network stack,
>> config,
>> > > monitoring, etc. If done correctly this would dramatically lower the
>> > > operational load of the system versus the current Kafka+ZK or proposed
>> > > Kafka+X.
>> >
>> > Perhaps the general understanding of these protocols have advanced enough
>> > that it doesn't make sense any longer to dependent on something like ZK,
>> > etcd, or Consul. One of the reasons for doing ZK in the first place was
>> > exactly to free developers from the burden of dealing with the complexity
>> > of broadcast/consensus protocols. The main problem that I've seen more
>> than
>> > once is that the homebred version ends up abandoned because it isn't the
>> > main focus of the project, and causes more pain. One of the things that
>> > worked with ZK was to have a community and a group of people responsible
>> > for thinking about it and maintaining it.
>> >
>> > >
>> > > I don't have a proposal for how this would work and it's some effort to
>> > > scope it out. The obvious thing to do would just be to keep the
>> existing
>> > > ISR/Controller setup and rebuild the controller etc on a RAFT/Paxos
>> impl
>> > > using the Kafka network/log/etc and have a replicated config database
>> > > (maybe rocksdb) that was fed off the log and shared by all nodes.
>> > >
>> >
>> > In principle, you need mechanisms like watches, ephemerals, and just pure
>> > storage. Just implement a replicated state machine with those operations.
>> > Watches and ephemerals make it a bit hard to do using a REST API, but it
>> is
>> > fine if you don't care about doing long polls. Perhaps this is too much
>> > detail for the status of this discussion.
>> >
>> >
>> > > If done well this could have the advantage of potentially allowing us
>> to
>> > > scale the number of partitions quite significantly (the k/v store would
>> > not
>> > > need to be all in memory), though you would likely still have limits on
>> > the
>> > > number of partitions per machine. This would make the minimum Kafka
>> > cluster
>> > > size be just your replication factor.
>> > >
>> >
>> > The idea of not having it in memory to scale isn't a bad one, specially
>> if
>> > you SSDs to cache data and such.
>> >
>> >
>> > > People tend to feel that implementing things like RAFT or Paxos is too
>> > hard
>> > > for mere mortals. But I actually think it is within our capabilities,
>> and
>> > > our testing capabilities as well as experience with this type of thing
>> > have
>> > > improved to the point where we should not be scared off if it is the
>> > right
>> > > path.
>> >
>> > It is hard, but certainly not impossible. I have implemented quite a few
>> > Paxos prototypes over the years for experiments, and getting off the
>> ground
>> > isn't hard, but getting an implementation that really works and that it
>> > maintainable is difficult. I've seen more than one instance where people
>> > got it wrong and caused pain. But again, maybe times have changed and
>> that
>> > won't happen here.
>> >
>> > -Flavio
>> >
>> > >
>> > > This approach is likely more work then plugins (though maybe not, once
>> > you
>> > > factor in all the docs, testing, etc) but if done correctly it would be
>> > an
>> > > unambiguous step forward--a simpler, more scalable implementation with
>> no
>> > > operational dependencies.
>> > >
>> > > Thoughts?
>> > >
>> > > -Jay
>> >
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > On Tue, Dec 1, 2015 at 11:12 AM, Joe Stein <joe.st...@stealth.ly>
>> wrote:
>> > >
>> > >> I would like to start a discussion around the work that has started in
>> > >> regards to KIP-30
>> > >>
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-30+-+Allow+for+brokers+to+have+plug-able+consensus+and+meta+data+storage+sub+systems
>> > >>
>> > >> The impetus for working on this came a lot from the community. For the
>> > last
>> > >> year(~+) it has been the most asked question at any talk I have given
>> > >> (personally speaking). It has come up a bit also on the mailing list
>> > >> talking about zkclient vs currator. A lot of folks want to use Kafka
>> but
>> > >> introducing dependencies are hard for the enterprise so the goals
>> behind
>> > >> this is making it so that using Kafka can be done as easy as possible
>> > for
>> > >> the operations teams to-do when they do. If they are already
>> supporting
>> > >> ZooKeeper they can keep doing that but if not they want (users) to use
>> > >> something else they are already supporting that can plug-in to-do the
>> > same
>> > >> things.
>> > >>
>> > >> For the core project I think we should leave in upstream what we have.
>> > This
>> > >> gives a great baseline regression for folks and makes the work for
>> > "making
>> > >> what we have plug-able work" a good defined task (carve out, layer in
>> > API
>> > >> impl, push back tests pass). From there then when folks want their
>> > >> implementation to be something besides ZooKeeper they can develop,
>> test
>> > and
>> > >> support that if they choose.
>> > >>
>> > >> We would like to suggest that we have the plugin interface be Java
>> based
>> > >> for minimizing depends for JVM impl. This could be in another
>> directory
>> > >> something TBD /<name>.
>> > >>
>> > >> If you have a server you want to try to get it working but you aren't
>> on
>> > >> the JVM don't be afraid just think about a REST impl and if you can
>> work
>> > >> inside of that you have some light RPC layers (this was the first pass
>> > >> prototype we did to flush-out the public api presented on the KIP).
>> > >>
>> > >> There are a lot of parts to working on this and the more
>> > implementations we
>> > >> have the better we can flush out the public interface. I will leave
>> the
>> > >> technical details and design to JIRA tickets that are linked through
>> the
>> > >> confluence page as these decisions come about and code starts for
>> > reviews
>> > >> and we can target the specific modules having the context separate is
>> > >> helpful especially if multiple folks are working on it.
>> > >> https://issues.apache.org/jira/browse/KAFKA-2916
>> > >>
>> > >> Do other folks want to build implementations? Maybe we should start a
>> > >> confluence page for those or use an existing one and add to it so we
>> can
>> > >> coordinate some there to.
>> > >>
>> > >> Thanks!
>> > >>
>> > >> ~ Joe Stein
>> > >> - - - - - - - - - - - - - - - - - - -
>> > >>     [image: Logo-Black.jpg]
>> > >>  http://www.elodina.net
>> > >>    http://www.stealth.ly
>> > >> - - - - - - - - - - - - - - - - - - -
>> > >>
>> >
>> >
>>

Re: [DISCUSS] KIP-30 Allow for brokers to have plug-able consensus and meta data storage sub systems

Reply via email to