Re: [DISCUSS] Road to Kafka 4.0

Luke Chen Thu, 21 Dec 2023 19:00:58 -0800

For release 3.8, I think we should also include the unclean leader election
support in KRaft.
But we can discuss more details in the KIP.


Thank you, Josep!
And thank you all for the comments!

Luke

On Fri, Dec 22, 2023 at 1:14 AM Ismael Juma <m...@ismaeljuma.com> wrote:

> Thank you Josep!
>
> Ismael
>
> On Thu, Dec 21, 2023, 9:09 AM Josep Prat <josep.p...@aiven.io.invalid>
> wrote:
>
> > Hi Ismael,
> >
> > I can volunteer to write the KIP. Unless somebody else has any
> objections,
> > I'll get to write it by the end of this week.
> >
> > Best,
> >
> > Josep Prat
> > Open Source Engineering Director, aivenjosep.p...@aiven.io   |
> > +491715557497 | aiven.io
> > Aiven Deutschland GmbH
> > Alexanderufer 3-7, 10117 Berlin
> > Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen
> > Amtsgericht Charlottenburg, HRB 209739 B
> >
> > On Thu, Dec 21, 2023, 17:58 Ismael Juma <m...@ismaeljuma.com> wrote:
> >
> > > Hi all,
> > >
> > > After understanding the use case Josep and Anton described in more
> > detail,
> > > I think it's fair to say that quorum reconfiguration is necessary for
> > > migration of Apache Kafka users who follow this pattern. Given that, I
> > > think we should have a 3.8 release before the 4.0 release.
> > >
> > > The next question is whether we should do something special when it
> comes
> > > to timeline, parallel releases, etc. After careful consideration, I
> think
> > > we should simply follow our usual approach: regular 3.8 release around
> > > early May 2024 and regular 4.0 release around early September 2024. The
> > > community will be able to start working on items specific to 4.0 after
> > 3.8
> > > is branched in late March/early April - I don't think we need to deal
> > with
> > > the overhead of maintaining multiple long-lived branches for
> > > feature development.
> > >
> > > If the proposal above sounds reasonable, I suggest we write a KIP and
> > vote
> > > on it. Any volunteers?
> > >
> > > Ismael
> > >
> > > On Tue, Nov 21, 2023 at 8:18 PM Ismael Juma <m...@ismaeljuma.com> wrote:
> > >
> > > > Hi Luke,
> > > >
> > > > I think we're conflating different things here. There are 3 separate
> > > > points in your email, but only 1 of them requires 3.8:
> > > >
> > > > 1. JBOD may have some bugs in 3.7.0. Whatever bugs exist can be fixed
> > in
> > > > 3.7.x. We have already said that we will backport critical fixes to
> > 3.7.x
> > > > for some time.
> > > > 2. Quorum reconfiguration is important to include in 4.0, the release
> > > > where ZK won't be supported. This doesn't need a 3.8 release either.
> > > > 3. Quorum reconfiguration is necessary for migration use cases and
> > hence
> > > > needs to be in a 3.x release. This one would require a 3.8 release if
> > > true.
> > > > But we should have a debate on whether it is indeed true. It's not
> > clear
> > > to
> > > > me yet.
> > > >
> > > > Ismael
> > > >
> > > > On Tue, Nov 21, 2023 at 7:30 PM Luke Chen <show...@gmail.com> wrote:
> > > >
> > > >> Hi Colin and Jose,
> > > >>
> > > >> I revisited the discussion of KIP-833 here
> > > >> <https://lists.apache.org/thread/90zkqvmmw3y8j6tkgbg3md78m7hs4yn6>,
> > and
> > > >> you
> > > >> can see I'm the first one to reply to the discussion thread to
> express
> > > my
> > > >> excitement at that time. Till now, I personally still think having
> > KRaft
> > > >> in
> > > >> Kafka is a good direction we have to move forward. But to move to
> this
> > > >> destination, we need to make our users comfortable with this
> decision.
> > > The
> > > >> worst scenario is, we said 4.0 is ready, and ZK is removed. Then,
> some
> > > >> users move to 4.0 and say, wait a minute, why does it not support
> xxx
> > > >> feature? And then start to search for other alternatives to replace
> > > Apache
> > > >> Kafka. We all don't want to see this, right? So, that's why some
> > > community
> > > >> users start to express their concern to move to 4.0 too quickly,
> > > including
> > > >> me.
> > > >>
> > > >>
> > > >> Quoting Colin:
> > > >> > While dynamic quorum reconfiguration is a nice feature, it doesn't
> > > block
> > > >> anything: not migration, not deployment.
> > > >>
> > > >> Clearly Confluent team might deploy ZooKeeper in a particular way
> and
> > > >> didn’t depend on its ability to support reconfiguration. So KRaft is
> > > ready
> > > >> from your point of view. But users of Apache Kafka might have come
> to
> > > >> depend on some ZooKeeper functionality, such as the ability to
> > > reconfigure
> > > >> ZooKeeper quorums, that is not available in KRaft, yet. I don’t
> think
> > > the
> > > >> Apache Kafka documentation has ever said “do not depend on this
> > ability
> > > of
> > > >> Apache Kafka or Zookeeper”, so it doesn’t seem unreasonable for
> users
> > to
> > > >> have deployed ZooKeeper in this way. In KIP-833
> > > >> <
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-MissingFeatures
> > > >> >,
> > > >> we said: “Modifying certain dynamic configurations on the standalone
> > > KRaft
> > > >> controller” was an important missing feature. Unfortunately it
> wasn’t
> > as
> > > >> explicit as it could have been. While no one expects KRaft to
> support
> > > all
> > > >> the features of ZooKeeper, it looks to me that users might depend on
> > > this
> > > >> particular feature and it’s only recently that it’s become apparent
> > that
> > > >> you don’t consider it a blocker.
> > > >>
> > > >> Quoting José:
> > > >> > If we do a 3.8 release before 4.0 and we implement KIP-853 in 3.8,
> > the
> > > >> user will be able to migrate to a KRaft cluster that supports
> > > dynamically
> > > >> changing the set of voters and has better support for disk failures.
> > > >>
> > > >> Yes, KIP-853 and disk failure support are both very important
> missing
> > > >> features. For the disk failure support, I don't think this is a
> > > >> "good-to-have-feature", it should be a "must-have" IMO. We can't
> > > announce
> > > >> the 4.0 release without a good solution for disk failure in KRaft.
> > > >>
> > > >> It’s also worth thinking about how Apache Kafka users who depend on
> > JBOD
> > > >> might look at the risks of not having a 3.8 release. JBOD support on
> > > KRaft
> > > >> is planned to be added in 3.7, and is still in progress so far. So
> > it’s
> > > >> hard to say it’s a blocker or not. But in practice, even if the
> > feature
> > > is
> > > >> made into 3.7 in time, a lot of new code for this feature is
> unlikely
> > to
> > > >> be
> > > >> entirely bug free. We need to maintain the confidence of those
> users,
> > > and
> > > >> forcing them to migrate through 3.7 where this new code is hardly
> > > >> battle-tested doesn’t appear to do that.
> > > >>
> > > >> Our goal for 4.0 should be that all the “main” features in KRaft are
> > in
> > > >> production ready state. To reach the goal, I think having one more
> > > release
> > > >> makes sense. We can have different opinions about what the “main
> > > features”
> > > >> in KRaft are, but we should all agree, JBOD is one of them.
> > > >>
> > > >> Alternatively, like Josep proposed, we can choose to have 4.0 +
> 3.7.x
> > or
> > > >> 3.8 releases in parallel to maintain these 2 releases for a defined
> > > >> period.
> > > >> But I think this is not a small effort to do that, especially as in
> > > v4.0,
> > > >> much of ZK code will be removed, thus the diff between codebases
> will
> > be
> > > >> large. In other words the additional costs of the backporting
> required
> > > >> with
> > > >> this alternative are likely to be higher than doing a 3.8 in my
> > opinion.
> > > >>
> > > >> Quoting José again:
> > > >> > What are the disadvantages of adding the 3.8 release before 4.0?
> > This
> > > >> would push the 4.0 release by 3-4 months. From what we can tell, it
> > > would
> > > >> also delay when KIP-896 can be implemented and extend how long the
> > > >> community needs to maintain the code used by ZK mode. Is there
> > anything
> > > >> else?
> > > >>
> > > >> If we agree with previous points, I think the disadvantages will
> just
> > > >> disappear. The 3-4 months delay, the maintenance effort, KIP-896,
> and
> > > >> maybe
> > > >> you can also raise scala 2.12 and java 8 removal, which are not that
> > > >> critical compared with what I mentioned earlier that the worst case
> > > might
> > > >> be that the users lose their confidence to Apache Kafka.
> > > >>
> > > >>
> > > >> Quoting Colin:
> > > >> > I would not want to delay that because we want an additional
> > feature.
> > > >> And
> > > >> we will always want additional features. So I am concerned we will
> end
> > > up
> > > >> in an infinite loop of people asking for "just one more feature"
> > before
> > > >> they migrate.
> > > >>
> > > >> I totally agree with you. We can keep delaying the 4.0 release
> > forever.
> > > >> I'd
> > > >> also like to draw a line to it. So, in my opinion, the 3.8 release
> is
> > > the
> > > >> line. No 3.9, 3.10 releases after that. If this is the decision,
> will
> > > your
> > > >> concern about this infinite loop disappear?
> > > >>
> > > >> Final note: Speaking of the missing features, I can always cooperate
> > > with
> > > >> you and all other community contributors to make them happen, like
> we
> > > have
> > > >> discussed earlier. Just let me know.
> > > >>
> > > >> Thank you.
> > > >> Luke
> > > >>
> > > >> On Wed, Nov 22, 2023 at 2:54 AM Colin McCabe <cmcc...@apache.org>
> > > wrote:
> > > >>
> > > >> > On Tue, Nov 21, 2023, at 03:47, Josep Prat wrote:
> > > >> > > Hi Colin,
> > > >> > >
> > > >> > > I think it's great that Confluent runs KRaft clusters in
> > production,
> > > >> > > and it means that it is production ready for Confluent and it's
> > > users.
> > > >> > > But luckily for Kafka, the community is bigger than this (self
> > > managed
> > > >> > > in the cloud or in-prem, or customers of other SaaS companies).
> > > >> >
> > > >> > Hi Josep,
> > > >> >
> > > >> > Confluent is not the only company using or developing KRaft. Most
> of
> > > the
> > > >> > big organizations developing Kafka are involved. I mentioned
> > > Confluent's
> > > >> > deployments because I wanted to be clear that KRaft mode is not
> > > >> > experimental or new. Talking about software in production is a
> good
> > > way
> > > >> to
> > > >> > clear up these misconceptions.
> > > >> >
> > > >> > Indeed, KRaft mode is many years old. It started around 2020, and
> > > became
> > > >> > production-ready in AK 3.5 in 2022. ZK mode was deprecated in AK
> > 3.5,
> > > >> which
> > > >> > was released June 2023. If we release AK 4.0 around April (or
> maybe
> > a
> > > >> month
> > > >> > or two later) then that will be almost a full year between
> > deprecation
> > > >> and
> > > >> > removal of ZK mode. We've talked about this a lot, in KIPs, in
> > Apache
> > > >> blog
> > > >> > posts, at conferences, and so forth.
> > > >> >
> > > >> > > We've heard at least from 1 SaaS company, Aiven (disclaimer, it
> is
> > > my
> > > >> > > employer) where the current feature set makes it not trivial to
> > > >> > > migrate. This same issue might happen not only at Aiven but with
> > any
> > > >> > > user of Kafka who uses immutable infrastructure.
> > > >> >
> > > >> > Can you discuss why you feel it is "not trivial to migrate"? From
> > the
> > > >> > discussion above, the main gap is that we should improve the
> > > >> documentation
> > > >> > for handling failed disks.
> > > >> >
> > > >> > > Another case is for
> > > >> > > users that have hundreds (or more) of clusters and more than
> 100k
> > > >> nodes
> > > >> > > experience node failures multiple times during a single day. In
> > this
> > > >> > > situation, not having KIP 853 makes these power users unable to
> > join
> > > >> > > the game as  introducing a new error-prone manual (or needed to
> > > >> > > automate) operation is usually a huge no-go.
> > > >> >
> > > >> > We have thousands of KRaft clusters in production and haven't seen
> > > these
> > > >> > problems, as I described above.
> > > >> >
> > > >> > best,
> > > >> > Colin
> > > >> >
> > > >> > >
> > > >> > > But I hear the concerns of delaying 4.0 for another 3 to 4
> months.
> > > >> > > Would it help if we would aim at shortening the timeline for
> 3.8.0
> > > and
> > > >> > > start with the 4.0.0 a bit earlier help?
> > > >> > > Maybe we could work on 3.8.0 almost in parallel with 4.0.0:
> > > >> > > - Start with 3.8.0 release process
> > > >> > > - After a small time (let's say a week) create the release
> branch
> > > >> > > - Start with 4.0.0 release process as usual
> > > >> > > - Cherry pick KRaft related issues to 3.8.0
> > > >> > > - Release 3.8.0
> > > >> > > I suspect 4.0.0 will need a bit more time than usual to ensure
> the
> > > >> code
> > > >> > > is cleaned up of deprecated classes and methods on top of the
> > usual
> > > >> > > work we have. For this reason I think there would be enough time
> > > >> > > between releasing 3.8.0 and 4.0.0.
> > > >> > >
> > > >> > > What do you all think?
> > > >> > >
> > > >> > > Best,
> > > >> > > Josep Prat
> > > >> > >
> > > >> > > On 2023/11/20 20:03:18 Colin McCabe wrote:
> > > >> > >> Hi Josep,
> > > >> > >>
> > > >> > >> I think there is some confusion here. Quorum reconfiguration is
> > not
> > > >> > needed for KRaft to become production ready. Confluent runs
> > thousands
> > > of
> > > >> > KRaft clusters without quorum reconfiguration, and has for years.
> > > While
> > > >> > dynamic quorum reconfiguration is a nice feature, it doesn't block
> > > >> > anything: not migration, not deployment. As best as I understand
> it,
> > > the
> > > >> > use-case Aiven has isn't even reconfiguration per se, just wiping
> a
> > > >> disk.
> > > >> > There are ways to handle this -- I discussed some earlier in the
> > > >> thread. I
> > > >> > think it would be productive to continue that discussion --
> > especially
> > > >> the
> > > >> > part around documentation and testing of these cases.
> > > >> > >>
> > > >> > >> A lot of people have done a lot of work to get Kafka 4.0
> ready. I
> > > >> would
> > > >> > not want to delay that because we want an additional feature. And
> we
> > > >> will
> > > >> > always want additional features. So I am concerned we will end up
> in
> > > an
> > > >> > infinite loop of people asking for "just one more feature" before
> > they
> > > >> > migrate.
> > > >> > >>
> > > >> > >> best,
> > > >> > >> Colin
> > > >> > >>
> > > >> > >>
> > > >> > >> On Mon, Nov 20, 2023, at 04:15, Josep Prat wrote:
> > > >> > >> > Hi all,
> > > >> > >> >
> > > >> > >> > I wanted to share my opinion regarding this topic. I know
> some
> > > >> > >> > discussions happened some time ago (over a year) but I
> believe
> > > it's
> > > >> > >> > wise to reflect and re-evaluate if those decisions are still
> > > valid.
> > > >> > >> > KRaft, as of Kafka 3.6.x and 3.7.x, has not yet feature
> parity
> > > with
> > > >> > >> > Zookeeper. By dropping Zookeeper altogether before achieving
> > such
> > > >> > >> > parity, we are opening the door to leaving a chunk of Apache
> > > Kafka
> > > >> > >> > users without an easy way to upgrade to 4.0.
> > > >> > >> > In pro of making upgrades as smooth as possible, I propose to
> > > have
> > > >> a
> > > >> > >> > Kafka version where KIP-853 is merged and Zookeeper still is
> > > >> > supported.
> > > >> > >> > This will enable community members who can't migrate yet to
> > KRaft
> > > >> to
> > > >> > do
> > > >> > >> > so in a safe way (rolling back is something goes wrong).
> > > >> > Additionally,
> > > >> > >> > this will give us more confidence on having KRaft replacing
> > > >> > >> > successfully Zookeeper without any big problems by
> discovering
> > > and
> > > >> > >> > fixing bugs or by confirming that KRaft works as expected.
> > > >> > >> > For this I strongly believe we should have a 3.8.x version
> > before
> > > >> > 4.0.x.
> > > >> > >> >
> > > >> > >> > What do other think in this regard?
> > > >> > >> >
> > > >> > >> > Best,
> > > >> > >> >
> > > >> > >> > On 2023/11/14 20:47:10 Colin McCabe wrote:
> > > >> > >> >> On Tue, Nov 14, 2023, at 04:37, Anton Agestam wrote:
> > > >> > >> >> > Hi Colin,
> > > >> > >> >> >
> > > >> > >> >> > Thank you for your thoughtful and comprehensive response.
> > > >> > >> >> >
> > > >> > >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We
> > discussed
> > > >> this
> > > >> > in
> > > >> > >> >> >> several KIPs that happened this year and last year. The
> > most
> > > >> > notable was
> > > >> > >> >> >> probably KIP-866, which was approved in May 2022.
> > > >> > >> >> >
> > > >> > >> >> > I understand this is the case, I'm raising my concern
> > because
> > > I
> > > >> was
> > > >> > >> >> > foreseeing some major pain points as a consequence of this
> > > >> > decision. Just
> > > >> > >> >> > to make it clear though: I am not asking for anyone to do
> > work
> > > >> for
> > > >> > me, and
> > > >> > >> >> > I understand the limitations of resources available to
> > > implement
> > > >> > features.
> > > >> > >> >> > What I was asking is rather to consider the implications
> of
> > > >> > _removing_
> > > >> > >> >> > features before there exists a replacement for them.
> > > >> > >> >> >
> > > >> > >> >> > I understand that the timeframe for 3.7 isn't feasible,
> and
> > > >> > because of that
> > > >> > >> >> > I think what I was asking is rather: can we make sure that
> > > there
> > > >> > are more
> > > >> > >> >> > 3.x releases until controller quorum online resizing is
> > > >> > implemented?
> > > >> > >> >> >
> > > >> > >> >> > From your response, I gather that your stance is that it's
> > > >> > important to
> > > >> > >> >> > drop ZK support sooner rather than later and that the
> > > necessary
> > > >> > pieces for
> > > >> > >> >> > doing so are already in place.
> > > >> > >> >>
> > > >> > >> >> Hi Anton,
> > > >> > >> >>
> > > >> > >> >> Yes. I'm basically just repeating what we agreed upon in
> 2022
> > as
> > > >> > part of KIP-833.
> > > >> > >> >>
> > > >> > >> >> >
> > > >> > >> >> > ---
> > > >> > >> >> >
> > > >> > >> >> > I want to make sure I've understood your suggested
> sequence
> > > for
> > > >> > controller
> > > >> > >> >> > node replacement. I hope the mentions of Kubernetes are
> > rather
> > > >> for
> > > >> > examples
> > > >> > >> >> > of how to carry things out, rather than saying "this is
> only
> > > >> > supported on
> > > >> > >> >> > Kubernetes"?
> > > >> > >> >>
> > > >> > >> >> Apache Kafka is supported in lots of environments, including
> > > >> non-k8s
> > > >> > ones. I was just pointing out that using k8s means that you
> control
> > > your
> > > >> > own DNS resolution, which simplifies matters. If you don't control
> > DNS
> > > >> > there are some extra steps for changing the quorum voters.
> > > >> > >> >>
> > > >> > >> >> >
> > > >> > >> >> > Given we have three existing nodes as such:
> > > >> > >> >> >
> > > >> > >> >> > - a.local -> 192.168.0.100
> > > >> > >> >> > - b.local -> 192.168.0.101
> > > >> > >> >> > - c.local -> 192.168.0.102
> > > >> > >> >> >
> > > >> > >> >> > As well as a candidate node 192.168.0.103 that we want to
> > > >> replace
> > > >> > for the
> > > >> > >> >> > role of c.local.
> > > >> > >> >> >
> > > >> > >> >> > 1. Shut down controller process on node .102 (to make sure
> > we
> > > >> > don't "go
> > > >> > >> >> > back in time").
> > > >> > >> >> > 2. rsync state from leader to .103.
> > > >> > >> >> > 3. Start controller process on .103.
> > > >> > >> >> > 4. Point the c.local entry at .103.
> > > >> > >> >> >
> > > >> > >> >> > I have a few questions about this sequence:
> > > >> > >> >> >
> > > >> > >> >> > 1. Would this sequence be safe against leadership changes?
> > > >> > >> >> >
> > > >> > >> >>
> > > >> > >> >> If the leader changes, the new leader should have all of the
> > > >> > committed entries that the old leader had.
> > > >> > >> >>
> > > >> > >> >> > 2. Does it work
> > > >> > >> >>
> > > >> > >> >> Probably the biggest issue is dealing with "torn writes"
> that
> > > >> happen
> > > >> > because you're copying the current log segment while it's being
> > > written
> > > >> to.
> > > >> > The system should be robust against this. However, we don't
> > regularly
> > > do
> > > >> > this, so there hasn't been a lot of testing.
> > > >> > >> >>
> > > >> > >> >> I think Jose had a PR for improving the handling of this
> which
> > > we
> > > >> > might want to dig up. We'd want the system to auto-truncate the
> > > partial
> > > >> > record at the end of the log, if there is one.
> > > >> > >> >>
> > > >> > >> >> > 3. By "state", do we mean `metadata.log.dir`? Something
> > else?
> > > >> > >> >>
> > > >> > >> >> Yes, the state of the metadata.log.dir. Keep in mind you
> will
> > > need
> > > >> > to change the node ID in meta.properties after copying, of course.
> > > >> > >> >>
> > > >> > >> >> > 4. What are the effects on cluster availability? (I think
> > this
> > > >> is
> > > >> > the same
> > > >> > >> >> > as asking what happens if a or b crashes during the
> process,
> > > or
> > > >> if
> > > >> > network
> > > >> > >> >> > partitions occur).
> > > >> > >> >>
> > > >> > >> >> Cluster metadata state tends to be pretty small. typically a
> > > >> hundred
> > > >> > megabytes or so. Therefore, I do not think it will take more than
> a
> > > >> second
> > > >> > or two to copy from one node to another. However, if you do
> > > experience a
> > > >> > crash when one node out of three is down, then you will be
> > unavailable
> > > >> > until you can bring up a second node to regain a majority.
> > > >> > >> >>
> > > >> > >> >> >
> > > >> > >> >> > ---
> > > >> > >> >> >
> > > >> > >> >> > If this is considered the official way of handling
> > controller
> > > >> node
> > > >> > >> >> > replacements, does it make sense to improve documentation
> in
> > > >> this
> > > >> > area? Is
> > > >> > >> >> > there already a plan for this documentation layed out in
> > some
> > > >> > KIPs? This is
> > > >> > >> >> > something I'd be happy to contribute to.
> > > >> > >> >> >
> > > >> > >> >>
> > > >> > >> >> Yes, I think we should have official documentation about
> this.
> > > >> We'd
> > > >> > be happy to review anything in that area.
> > > >> > >> >>
> > > >> > >> >> >> To circle back to KIP-853, I think it stands a good
> chance
> > of
> > > >> > making it
> > > >> > >> >> >> into AK 4.0.
> > > >> > >> >> >
> > > >> > >> >> > This sounds good, but the point I was making was if we
> could
> > > >> have
> > > >> > a release
> > > >> > >> >> > with both KRaft and ZK supporting this feature to ease the
> > > >> > migration out of
> > > >> > >> >> > ZK.
> > > >> > >> >> >
> > > >> > >> >>
> > > >> > >> >> The problem is, supporting multiple controller
> implementations
> > > is
> > > >> a
> > > >> > huge burden. So we don't want to extend the 3.x release past the
> > point
> > > >> > that's needed to complete all the must-dos (SCRAM, delegation
> > tokens,
> > > >> JBOD)
> > > >> > >> >>
> > > >> > >> >> best,
> > > >> > >> >> Colin
> > > >> > >> >>
> > > >> > >> >>
> > > >> > >> >> > BR,
> > > >> > >> >> > Anton
> > > >> > >> >> >
> > > >> > >> >> > Den tors 9 nov. 2023 kl 23:04 skrev Colin McCabe <
> > > >> > cmcc...@apache.org>:
> > > >> > >> >> >
> > > >> > >> >> >> Hi Anton,
> > > >> > >> >> >>
> > > >> > >> >> >> It rarely makes sense to scale up and down the number of
> > > >> > controller nodes
> > > >> > >> >> >> in the cluster. Only one controller node will be active
> at
> > > any
> > > >> > given time.
> > > >> > >> >> >> The main reason to use 5 nodes would be to be able to
> > > tolerate
> > > >> 2
> > > >> > failures
> > > >> > >> >> >> instead of 1.
> > > >> > >> >> >>
> > > >> > >> >> >> At Confluent, we generally run KRaft with 3 controllers.
> We
> > > >> have
> > > >> > not seen
> > > >> > >> >> >> problems with this setup, even with thousands of
> clusters.
> > We
> > > >> have
> > > >> > >> >> >> discussed using 5 node controller clusters on certain
> very
> > > big
> > > >> > clusters,
> > > >> > >> >> >> but we haven't done that yet. This is all very similar to
> > ZK,
> > > >> > where most
> > > >> > >> >> >> deployments were 3 nodes as well.
> > > >> > >> >> >>
> > > >> > >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We
> > discussed
> > > >> this
> > > >> > in
> > > >> > >> >> >> several KIPs that happened this year and last year. The
> > most
> > > >> > notable was
> > > >> > >> >> >> probably KIP-866, which was approved in May 2022.
> > > >> > >> >> >>
> > > >> > >> >> >> Many users these days run in a Kubernetes environment
> where
> > > >> > Kubernetes
> > > >> > >> >> >> actually controls the DNS. This makes changing the set of
> > > >> voters
> > > >> > less
> > > >> > >> >> >> important than it was historically.
> > > >> > >> >> >>
> > > >> > >> >> >> For example, in a world with static DNS, you might have
> to
> > > >> change
> > > >> > the
> > > >> > >> >> >> controller.quorum.voters setting from:
> > > >> > >> >> >>
> > > >> > >> >> >> 100@a.local:9073,101@b.local:9073,102@c.local:9073
> > > >> > >> >> >>
> > > >> > >> >> >> to:
> > > >> > >> >> >>
> > > >> > >> >> >> 100@a.local:9073,101@b.local:9073,102@d.local:9073
> > > >> > >> >> >>
> > > >> > >> >> >> In a world with k8s controlling the DNS, you simply remap
> > > >> c.local
> > > >> > to point
> > > >> > >> >> >> ot the IP address of your new pod for controller 102, and
> > > >> you're
> > > >> > done. No
> > > >> > >> >> >> need to update controller.quorum.voters.
> > > >> > >> >> >>
> > > >> > >> >> >> Another question is whether you re-create the pod data
> from
> > > >> > scratch every
> > > >> > >> >> >> time you add a new node. If you store the controller data
> > on
> > > an
> > > >> > EBS volume
> > > >> > >> >> >> (or cloud-specific equivalent), you really only have to
> > > detach
> > > >> it
> > > >> > from the
> > > >> > >> >> >> previous pod and re-attach it to the new pod. k8s also
> > > handles
> > > >> > this
> > > >> > >> >> >> automatically, of course.
> > > >> > >> >> >>
> > > >> > >> >> >> If you want to reconstruct the full controller pod state
> > each
> > > >> > time you
> > > >> > >> >> >> create a new pod (for example, so that you can use only
> > > >> instance
> > > >> > storage),
> > > >> > >> >> >> you should be able to rsync that state from the leader.
> In
> > > >> > general, the
> > > >> > >> >> >> invariant that we want to maintain is that the state
> should
> > > not
> > > >> > "go back in
> > > >> > >> >> >> time" -- if controller 102 promised to hold all log data
> up
> > > to
> > > >> > offset X, it
> > > >> > >> >> >> should come back with committed data at at least that
> > offset.
> > > >> > >> >> >>
> > > >> > >> >> >> There are lots of new features we'd like to implement for
> > > >> KRaft,
> > > >> > and Kafka
> > > >> > >> >> >> in general. If you have some you really would like to
> see,
> > I
> > > >> > think everyone
> > > >> > >> >> >> in the community would be happy to work with you. The
> flip
> > > >> side,
> > > >> > of course,
> > > >> > >> >> >> is that since there are an unlimited number of features
> we
> > > >> could
> > > >> > do, we
> > > >> > >> >> >> can't really block the release for any one feature.
> > > >> > >> >> >>
> > > >> > >> >> >> To circle back to KIP-853, I think it stands a good
> chance
> > of
> > > >> > making it
> > > >> > >> >> >> into AK 4.0. Jose, Alyssa, and some other people have
> > worked
> > > on
> > > >> > it. It
> > > >> > >> >> >> definitely won't make it into 3.7, since we have only a
> few
> > > >> weeks
> > > >> > left
> > > >> > >> >> >> before that release happens.
> > > >> > >> >> >>
> > > >> > >> >> >> best,
> > > >> > >> >> >> Colin
> > > >> > >> >> >>
> > > >> > >> >> >>
> > > >> > >> >> >> On Thu, Nov 9, 2023, at 00:20, Anton Agestam wrote:
> > > >> > >> >> >> > Hi Luke,
> > > >> > >> >> >> >
> > > >> > >> >> >> > We have been looking into what switching from ZK to
> KRaft
> > > >> will
> > > >> > mean for
> > > >> > >> >> >> > Aiven.
> > > >> > >> >> >> >
> > > >> > >> >> >> > We heavily depend on an “immutable infrastructure”
> model
> > > for
> > > >> > deployments.
> > > >> > >> >> >> > This means that, when we perform upgrades, we introduce
> > new
> > > >> > nodes to our
> > > >> > >> >> >> > clusters, scale the cluster up to incorporate the new
> > > nodes,
> > > >> > and then
> > > >> > >> >> >> phase
> > > >> > >> >> >> > the old ones out once all partitions are moved to the
> new
> > > >> > generation.
> > > >> > >> >> >> This
> > > >> > >> >> >> > allows us, and anyone else using a similar model, to do
> > > >> > upgrades as well
> > > >> > >> >> >> as
> > > >> > >> >> >> > cluster resizing with zero downtime.
> > > >> > >> >> >> >
> > > >> > >> >> >> > Reading up on KRaft and the ZK-to-KRaft migration path,
> > > this
> > > >> is
> > > >> > somewhat
> > > >> > >> >> >> > worrying for us. It seems like, if KIP-853 is not
> > included
> > > >> > prior to
> > > >> > >> >> >> > dropping support for ZK, we will essentially have no
> > > >> satisfying
> > > >> > upgrade
> > > >> > >> >> >> > path. Even if KIP-853 is included in 4.0, I’m unsure if
> > > that
> > > >> > would allow
> > > >> > >> >> >> a
> > > >> > >> >> >> > migration path for us, since a new cluster generation
> > would
> > > >> not
> > > >> > be able
> > > >> > >> >> >> to
> > > >> > >> >> >> > use ZK during the migration step.
> > > >> > >> >> >> > On the other hand, if KIP-853 was released in a version
> > > prior
> > > >> > to dropping
> > > >> > >> >> >> > ZK support, because it allows online resizing of KRaft
> > > >> > clusters, this
> > > >> > >> >> >> would
> > > >> > >> >> >> > allow us and others that use an immutable
> infrastructure
> > > >> > deployment
> > > >> > >> >> >> model,
> > > >> > >> >> >> > to provide a zero downtime migration path.
> > > >> > >> >> >> >
> > > >> > >> >> >> > For that reason, we’d like to raise awareness around
> this
> > > >> issue
> > > >> > and
> > > >> > >> >> >> > encourage considering the implementation of KIP-853 or
> > > >> > equivalent a
> > > >> > >> >> >> blocker
> > > >> > >> >> >> > not only for 4.0, but for the last version prior to
> 4.0.
> > > >> > >> >> >> >
> > > >> > >> >> >> > BR,
> > > >> > >> >> >> > Anton
> > > >> > >> >> >> >
> > > >> > >> >> >> > On 2023/10/11 12:17:23 Luke Chen wrote:
> > > >> > >> >> >> >> Hi all,
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> While Kafka 3.6.0 is released, I’d like to start the
> > > >> > discussion for the
> > > >> > >> >> >> >> “road to Kafka 4.0”. Based on the plan in KIP-833
> > > >> > >> >> >> >> <
> > > >> > >> >> >> >
> > > >> > >> >> >>
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7
> > > >> > >> >> >> >>,
> > > >> > >> >> >> >> the next release 3.7 will be the final release before
> > > moving
> > > >> > to Kafka
> > > >> > >> >> >> 4.0
> > > >> > >> >> >> >> to remove the Zookeeper from Kafka. Before making this
> > > major
> > > >> > change, I'd
> > > >> > >> >> >> >> like to get consensus on the "must-have features/fixes
> > for
> > > >> > Kafka 4.0",
> > > >> > >> >> >> to
> > > >> > >> >> >> >> avoid some users being surprised when upgrading to
> Kafka
> > > >> 4.0.
> > > >> > The intent
> > > >> > >> >> >> > is
> > > >> > >> >> >> >> to have a clear communication about what to expect in
> > the
> > > >> > following
> > > >> > >> >> >> > months.
> > > >> > >> >> >> >> In particular we should be signaling what features and
> > > >> > configurations
> > > >> > >> >> >> are
> > > >> > >> >> >> >> not supported, or at risk (if no one is able to add
> > > support
> > > >> or
> > > >> > fix known
> > > >> > >> >> >> >> bugs).
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> Here is the JIRA tickets list
> > > >> > >> >> >> >> <
> > > >> >
> > https://issues.apache.org/jira/issues/?jql=labels%20%3D%204.0-blocker
> > > >
> > > >> > >> >> >> I
> > > >> > >> >> >> >> labeled for "4.0-blocker". The criteria I labeled as
> > > >> > “4.0-blocker” are:
> > > >> > >> >> >> >> 1. The feature is supported in Zookeeper Mode, but not
> > > >> > supported in
> > > >> > >> >> >> KRaft
> > > >> > >> >> >> >> mode, yet (ex: KIP-858: JBOD in KRaft)
> > > >> > >> >> >> >> 2. Critical bugs in KRaft, (ex: KAFKA-15489 : split
> > brain
> > > in
> > > >> > KRaft
> > > >> > >> >> >> >> controller quorum)
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> If you disagree with my current list, welcome to have
> > > >> > discussion in the
> > > >> > >> >> >> >> specific JIRA ticket. Or, if you think there are some
> > > >> tickets
> > > >> > I missed,
> > > >> > >> >> >> >> welcome to start a discussion in the JIRA ticket and
> > ping
> > > me
> > > >> > or other
> > > >> > >> >> >> >> people. After we get the consensus, we can
> label/unlabel
> > > it
> > > >> > afterwards.
> > > >> > >> >> >> >> Again, the goal is to have an open communication with
> > the
> > > >> > community
> > > >> > >> >> >> about
> > > >> > >> >> >> >> what will be coming in 4.0.
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> Below is the high level category of the list content:
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> 1. Recovery from disk failure
> > > >> > >> >> >> >> KIP-856
> > > >> > >> >> >> >> <
> > > >> > >> >> >> >
> > > >> > >> >> >>
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery
> > > >> > >> >> >> >>:
> > > >> > >> >> >> >> KRaft Disk Failure Recovery
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> 2. Prevote to support controllers more than 3
> > > >> > >> >> >> >> KIP-650
> > > >> > >> >> >> >> <
> > > >> > >> >> >> >
> > > >> > >> >> >>
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics
> > > >> > >> >> >> >>:
> > > >> > >> >> >> >> Enhance Kafkaesque Raft semantics
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> 3. JBOD support
> > > >> > >> >> >> >> KIP-858
> > > >> > >> >> >> >> <
> > > >> > >> >> >> >
> > > >> > >> >> >>
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft
> > > >> > >> >> >> >>:
> > > >> > >> >> >> >> Handle
> > > >> > >> >> >> >> JBOD broker disk failure in KRaft
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> 4. Scale up/down Controllers
> > > >> > >> >> >> >> KIP-853
> > > >> > >> >> >> >> <
> > > >> > >> >> >> >
> > > >> > >> >> >>
> > > >> >
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes
> > > >> > >> >> >> >>:
> > > >> > >> >> >> >> KRaft Controller Membership Changes
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> 5. Modifying dynamic configurations on the KRaft
> > > controller
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> 6. Critical bugs in KRaft
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> Does this make sense?
> > > >> > >> >> >> >> Any feedback is welcomed.
> > > >> > >> >> >> >>
> > > >> > >> >> >> >> Thank you.
> > > >> > >> >> >> >> Luke
> > > >> > >> >> >> >>
> > > >> > >> >> >>
> > > >> > >> >>
> > > >> > >>
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] Road to Kafka 4.0

Reply via email to