Hi all, Discuss thread for KIP-1012 (The need for a Kafka 3.8 release) can be found under https://lists.apache.org/thread/kvdp2gmq5gd9txkvxh5vk3z2n55b04s5
Best, On Fri, Dec 22, 2023 at 4:00 AM Luke Chen <show...@gmail.com> wrote: > For release 3.8, I think we should also include the unclean leader election > support in KRaft. > But we can discuss more details in the KIP. > > Thank you, Josep! > And thank you all for the comments! > > Luke > > On Fri, Dec 22, 2023 at 1:14 AM Ismael Juma <m...@ismaeljuma.com> wrote: > > > Thank you Josep! > > > > Ismael > > > > On Thu, Dec 21, 2023, 9:09 AM Josep Prat <josep.p...@aiven.io.invalid> > > wrote: > > > > > Hi Ismael, > > > > > > I can volunteer to write the KIP. Unless somebody else has any > > objections, > > > I'll get to write it by the end of this week. > > > > > > Best, > > > > > > Josep Prat > > > Open Source Engineering Director, aivenjosep.p...@aiven.io | > > > +491715557497 | aiven.io > > > Aiven Deutschland GmbH > > > Alexanderufer 3-7, 10117 Berlin > > > Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen > > > Amtsgericht Charlottenburg, HRB 209739 B > > > > > > On Thu, Dec 21, 2023, 17:58 Ismael Juma <m...@ismaeljuma.com> wrote: > > > > > > > Hi all, > > > > > > > > After understanding the use case Josep and Anton described in more > > > detail, > > > > I think it's fair to say that quorum reconfiguration is necessary for > > > > migration of Apache Kafka users who follow this pattern. Given that, > I > > > > think we should have a 3.8 release before the 4.0 release. > > > > > > > > The next question is whether we should do something special when it > > comes > > > > to timeline, parallel releases, etc. After careful consideration, I > > think > > > > we should simply follow our usual approach: regular 3.8 release > around > > > > early May 2024 and regular 4.0 release around early September 2024. > The > > > > community will be able to start working on items specific to 4.0 > after > > > 3.8 > > > > is branched in late March/early April - I don't think we need to deal > > > with > > > > the overhead of maintaining multiple long-lived branches for > > > > feature development. > > > > > > > > If the proposal above sounds reasonable, I suggest we write a KIP and > > > vote > > > > on it. Any volunteers? > > > > > > > > Ismael > > > > > > > > On Tue, Nov 21, 2023 at 8:18 PM Ismael Juma <m...@ismaeljuma.com> > wrote: > > > > > > > > > Hi Luke, > > > > > > > > > > I think we're conflating different things here. There are 3 > separate > > > > > points in your email, but only 1 of them requires 3.8: > > > > > > > > > > 1. JBOD may have some bugs in 3.7.0. Whatever bugs exist can be > fixed > > > in > > > > > 3.7.x. We have already said that we will backport critical fixes to > > > 3.7.x > > > > > for some time. > > > > > 2. Quorum reconfiguration is important to include in 4.0, the > release > > > > > where ZK won't be supported. This doesn't need a 3.8 release > either. > > > > > 3. Quorum reconfiguration is necessary for migration use cases and > > > hence > > > > > needs to be in a 3.x release. This one would require a 3.8 release > if > > > > true. > > > > > But we should have a debate on whether it is indeed true. It's not > > > clear > > > > to > > > > > me yet. > > > > > > > > > > Ismael > > > > > > > > > > On Tue, Nov 21, 2023 at 7:30 PM Luke Chen <show...@gmail.com> > wrote: > > > > > > > > > >> Hi Colin and Jose, > > > > >> > > > > >> I revisited the discussion of KIP-833 here > > > > >> <https://lists.apache.org/thread/90zkqvmmw3y8j6tkgbg3md78m7hs4yn6 > >, > > > and > > > > >> you > > > > >> can see I'm the first one to reply to the discussion thread to > > express > > > > my > > > > >> excitement at that time. Till now, I personally still think having > > > KRaft > > > > >> in > > > > >> Kafka is a good direction we have to move forward. But to move to > > this > > > > >> destination, we need to make our users comfortable with this > > decision. > > > > The > > > > >> worst scenario is, we said 4.0 is ready, and ZK is removed. Then, > > some > > > > >> users move to 4.0 and say, wait a minute, why does it not support > > xxx > > > > >> feature? And then start to search for other alternatives to > replace > > > > Apache > > > > >> Kafka. We all don't want to see this, right? So, that's why some > > > > community > > > > >> users start to express their concern to move to 4.0 too quickly, > > > > including > > > > >> me. > > > > >> > > > > >> > > > > >> Quoting Colin: > > > > >> > While dynamic quorum reconfiguration is a nice feature, it > doesn't > > > > block > > > > >> anything: not migration, not deployment. > > > > >> > > > > >> Clearly Confluent team might deploy ZooKeeper in a particular way > > and > > > > >> didn’t depend on its ability to support reconfiguration. So KRaft > is > > > > ready > > > > >> from your point of view. But users of Apache Kafka might have come > > to > > > > >> depend on some ZooKeeper functionality, such as the ability to > > > > reconfigure > > > > >> ZooKeeper quorums, that is not available in KRaft, yet. I don’t > > think > > > > the > > > > >> Apache Kafka documentation has ever said “do not depend on this > > > ability > > > > of > > > > >> Apache Kafka or Zookeeper”, so it doesn’t seem unreasonable for > > users > > > to > > > > >> have deployed ZooKeeper in this way. In KIP-833 > > > > >> < > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-MissingFeatures > > > > >> >, > > > > >> we said: “Modifying certain dynamic configurations on the > standalone > > > > KRaft > > > > >> controller” was an important missing feature. Unfortunately it > > wasn’t > > > as > > > > >> explicit as it could have been. While no one expects KRaft to > > support > > > > all > > > > >> the features of ZooKeeper, it looks to me that users might depend > on > > > > this > > > > >> particular feature and it’s only recently that it’s become > apparent > > > that > > > > >> you don’t consider it a blocker. > > > > >> > > > > >> Quoting José: > > > > >> > If we do a 3.8 release before 4.0 and we implement KIP-853 in > 3.8, > > > the > > > > >> user will be able to migrate to a KRaft cluster that supports > > > > dynamically > > > > >> changing the set of voters and has better support for disk > failures. > > > > >> > > > > >> Yes, KIP-853 and disk failure support are both very important > > missing > > > > >> features. For the disk failure support, I don't think this is a > > > > >> "good-to-have-feature", it should be a "must-have" IMO. We can't > > > > announce > > > > >> the 4.0 release without a good solution for disk failure in KRaft. > > > > >> > > > > >> It’s also worth thinking about how Apache Kafka users who depend > on > > > JBOD > > > > >> might look at the risks of not having a 3.8 release. JBOD support > on > > > > KRaft > > > > >> is planned to be added in 3.7, and is still in progress so far. So > > > it’s > > > > >> hard to say it’s a blocker or not. But in practice, even if the > > > feature > > > > is > > > > >> made into 3.7 in time, a lot of new code for this feature is > > unlikely > > > to > > > > >> be > > > > >> entirely bug free. We need to maintain the confidence of those > > users, > > > > and > > > > >> forcing them to migrate through 3.7 where this new code is hardly > > > > >> battle-tested doesn’t appear to do that. > > > > >> > > > > >> Our goal for 4.0 should be that all the “main” features in KRaft > are > > > in > > > > >> production ready state. To reach the goal, I think having one more > > > > release > > > > >> makes sense. We can have different opinions about what the “main > > > > features” > > > > >> in KRaft are, but we should all agree, JBOD is one of them. > > > > >> > > > > >> Alternatively, like Josep proposed, we can choose to have 4.0 + > > 3.7.x > > > or > > > > >> 3.8 releases in parallel to maintain these 2 releases for a > defined > > > > >> period. > > > > >> But I think this is not a small effort to do that, especially as > in > > > > v4.0, > > > > >> much of ZK code will be removed, thus the diff between codebases > > will > > > be > > > > >> large. In other words the additional costs of the backporting > > required > > > > >> with > > > > >> this alternative are likely to be higher than doing a 3.8 in my > > > opinion. > > > > >> > > > > >> Quoting José again: > > > > >> > What are the disadvantages of adding the 3.8 release before 4.0? > > > This > > > > >> would push the 4.0 release by 3-4 months. From what we can tell, > it > > > > would > > > > >> also delay when KIP-896 can be implemented and extend how long the > > > > >> community needs to maintain the code used by ZK mode. Is there > > > anything > > > > >> else? > > > > >> > > > > >> If we agree with previous points, I think the disadvantages will > > just > > > > >> disappear. The 3-4 months delay, the maintenance effort, KIP-896, > > and > > > > >> maybe > > > > >> you can also raise scala 2.12 and java 8 removal, which are not > that > > > > >> critical compared with what I mentioned earlier that the worst > case > > > > might > > > > >> be that the users lose their confidence to Apache Kafka. > > > > >> > > > > >> > > > > >> Quoting Colin: > > > > >> > I would not want to delay that because we want an additional > > > feature. > > > > >> And > > > > >> we will always want additional features. So I am concerned we will > > end > > > > up > > > > >> in an infinite loop of people asking for "just one more feature" > > > before > > > > >> they migrate. > > > > >> > > > > >> I totally agree with you. We can keep delaying the 4.0 release > > > forever. > > > > >> I'd > > > > >> also like to draw a line to it. So, in my opinion, the 3.8 release > > is > > > > the > > > > >> line. No 3.9, 3.10 releases after that. If this is the decision, > > will > > > > your > > > > >> concern about this infinite loop disappear? > > > > >> > > > > >> Final note: Speaking of the missing features, I can always > cooperate > > > > with > > > > >> you and all other community contributors to make them happen, like > > we > > > > have > > > > >> discussed earlier. Just let me know. > > > > >> > > > > >> Thank you. > > > > >> Luke > > > > >> > > > > >> On Wed, Nov 22, 2023 at 2:54 AM Colin McCabe <cmcc...@apache.org> > > > > wrote: > > > > >> > > > > >> > On Tue, Nov 21, 2023, at 03:47, Josep Prat wrote: > > > > >> > > Hi Colin, > > > > >> > > > > > > >> > > I think it's great that Confluent runs KRaft clusters in > > > production, > > > > >> > > and it means that it is production ready for Confluent and > it's > > > > users. > > > > >> > > But luckily for Kafka, the community is bigger than this (self > > > > managed > > > > >> > > in the cloud or in-prem, or customers of other SaaS > companies). > > > > >> > > > > > >> > Hi Josep, > > > > >> > > > > > >> > Confluent is not the only company using or developing KRaft. > Most > > of > > > > the > > > > >> > big organizations developing Kafka are involved. I mentioned > > > > Confluent's > > > > >> > deployments because I wanted to be clear that KRaft mode is not > > > > >> > experimental or new. Talking about software in production is a > > good > > > > way > > > > >> to > > > > >> > clear up these misconceptions. > > > > >> > > > > > >> > Indeed, KRaft mode is many years old. It started around 2020, > and > > > > became > > > > >> > production-ready in AK 3.5 in 2022. ZK mode was deprecated in AK > > > 3.5, > > > > >> which > > > > >> > was released June 2023. If we release AK 4.0 around April (or > > maybe > > > a > > > > >> month > > > > >> > or two later) then that will be almost a full year between > > > deprecation > > > > >> and > > > > >> > removal of ZK mode. We've talked about this a lot, in KIPs, in > > > Apache > > > > >> blog > > > > >> > posts, at conferences, and so forth. > > > > >> > > > > > >> > > We've heard at least from 1 SaaS company, Aiven (disclaimer, > it > > is > > > > my > > > > >> > > employer) where the current feature set makes it not trivial > to > > > > >> > > migrate. This same issue might happen not only at Aiven but > with > > > any > > > > >> > > user of Kafka who uses immutable infrastructure. > > > > >> > > > > > >> > Can you discuss why you feel it is "not trivial to migrate"? > From > > > the > > > > >> > discussion above, the main gap is that we should improve the > > > > >> documentation > > > > >> > for handling failed disks. > > > > >> > > > > > >> > > Another case is for > > > > >> > > users that have hundreds (or more) of clusters and more than > > 100k > > > > >> nodes > > > > >> > > experience node failures multiple times during a single day. > In > > > this > > > > >> > > situation, not having KIP 853 makes these power users unable > to > > > join > > > > >> > > the game as introducing a new error-prone manual (or needed > to > > > > >> > > automate) operation is usually a huge no-go. > > > > >> > > > > > >> > We have thousands of KRaft clusters in production and haven't > seen > > > > these > > > > >> > problems, as I described above. > > > > >> > > > > > >> > best, > > > > >> > Colin > > > > >> > > > > > >> > > > > > > >> > > But I hear the concerns of delaying 4.0 for another 3 to 4 > > months. > > > > >> > > Would it help if we would aim at shortening the timeline for > > 3.8.0 > > > > and > > > > >> > > start with the 4.0.0 a bit earlier help? > > > > >> > > Maybe we could work on 3.8.0 almost in parallel with 4.0.0: > > > > >> > > - Start with 3.8.0 release process > > > > >> > > - After a small time (let's say a week) create the release > > branch > > > > >> > > - Start with 4.0.0 release process as usual > > > > >> > > - Cherry pick KRaft related issues to 3.8.0 > > > > >> > > - Release 3.8.0 > > > > >> > > I suspect 4.0.0 will need a bit more time than usual to ensure > > the > > > > >> code > > > > >> > > is cleaned up of deprecated classes and methods on top of the > > > usual > > > > >> > > work we have. For this reason I think there would be enough > time > > > > >> > > between releasing 3.8.0 and 4.0.0. > > > > >> > > > > > > >> > > What do you all think? > > > > >> > > > > > > >> > > Best, > > > > >> > > Josep Prat > > > > >> > > > > > > >> > > On 2023/11/20 20:03:18 Colin McCabe wrote: > > > > >> > >> Hi Josep, > > > > >> > >> > > > > >> > >> I think there is some confusion here. Quorum reconfiguration > is > > > not > > > > >> > needed for KRaft to become production ready. Confluent runs > > > thousands > > > > of > > > > >> > KRaft clusters without quorum reconfiguration, and has for > years. > > > > While > > > > >> > dynamic quorum reconfiguration is a nice feature, it doesn't > block > > > > >> > anything: not migration, not deployment. As best as I understand > > it, > > > > the > > > > >> > use-case Aiven has isn't even reconfiguration per se, just > wiping > > a > > > > >> disk. > > > > >> > There are ways to handle this -- I discussed some earlier in the > > > > >> thread. I > > > > >> > think it would be productive to continue that discussion -- > > > especially > > > > >> the > > > > >> > part around documentation and testing of these cases. > > > > >> > >> > > > > >> > >> A lot of people have done a lot of work to get Kafka 4.0 > > ready. I > > > > >> would > > > > >> > not want to delay that because we want an additional feature. > And > > we > > > > >> will > > > > >> > always want additional features. So I am concerned we will end > up > > in > > > > an > > > > >> > infinite loop of people asking for "just one more feature" > before > > > they > > > > >> > migrate. > > > > >> > >> > > > > >> > >> best, > > > > >> > >> Colin > > > > >> > >> > > > > >> > >> > > > > >> > >> On Mon, Nov 20, 2023, at 04:15, Josep Prat wrote: > > > > >> > >> > Hi all, > > > > >> > >> > > > > > >> > >> > I wanted to share my opinion regarding this topic. I know > > some > > > > >> > >> > discussions happened some time ago (over a year) but I > > believe > > > > it's > > > > >> > >> > wise to reflect and re-evaluate if those decisions are > still > > > > valid. > > > > >> > >> > KRaft, as of Kafka 3.6.x and 3.7.x, has not yet feature > > parity > > > > with > > > > >> > >> > Zookeeper. By dropping Zookeeper altogether before > achieving > > > such > > > > >> > >> > parity, we are opening the door to leaving a chunk of > Apache > > > > Kafka > > > > >> > >> > users without an easy way to upgrade to 4.0. > > > > >> > >> > In pro of making upgrades as smooth as possible, I propose > to > > > > have > > > > >> a > > > > >> > >> > Kafka version where KIP-853 is merged and Zookeeper still > is > > > > >> > supported. > > > > >> > >> > This will enable community members who can't migrate yet to > > > KRaft > > > > >> to > > > > >> > do > > > > >> > >> > so in a safe way (rolling back is something goes wrong). > > > > >> > Additionally, > > > > >> > >> > this will give us more confidence on having KRaft replacing > > > > >> > >> > successfully Zookeeper without any big problems by > > discovering > > > > and > > > > >> > >> > fixing bugs or by confirming that KRaft works as expected. > > > > >> > >> > For this I strongly believe we should have a 3.8.x version > > > before > > > > >> > 4.0.x. > > > > >> > >> > > > > > >> > >> > What do other think in this regard? > > > > >> > >> > > > > > >> > >> > Best, > > > > >> > >> > > > > > >> > >> > On 2023/11/14 20:47:10 Colin McCabe wrote: > > > > >> > >> >> On Tue, Nov 14, 2023, at 04:37, Anton Agestam wrote: > > > > >> > >> >> > Hi Colin, > > > > >> > >> >> > > > > > >> > >> >> > Thank you for your thoughtful and comprehensive > response. > > > > >> > >> >> > > > > > >> > >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We > > > discussed > > > > >> this > > > > >> > in > > > > >> > >> >> >> several KIPs that happened this year and last year. The > > > most > > > > >> > notable was > > > > >> > >> >> >> probably KIP-866, which was approved in May 2022. > > > > >> > >> >> > > > > > >> > >> >> > I understand this is the case, I'm raising my concern > > > because > > > > I > > > > >> was > > > > >> > >> >> > foreseeing some major pain points as a consequence of > this > > > > >> > decision. Just > > > > >> > >> >> > to make it clear though: I am not asking for anyone to > do > > > work > > > > >> for > > > > >> > me, and > > > > >> > >> >> > I understand the limitations of resources available to > > > > implement > > > > >> > features. > > > > >> > >> >> > What I was asking is rather to consider the implications > > of > > > > >> > _removing_ > > > > >> > >> >> > features before there exists a replacement for them. > > > > >> > >> >> > > > > > >> > >> >> > I understand that the timeframe for 3.7 isn't feasible, > > and > > > > >> > because of that > > > > >> > >> >> > I think what I was asking is rather: can we make sure > that > > > > there > > > > >> > are more > > > > >> > >> >> > 3.x releases until controller quorum online resizing is > > > > >> > implemented? > > > > >> > >> >> > > > > > >> > >> >> > From your response, I gather that your stance is that > it's > > > > >> > important to > > > > >> > >> >> > drop ZK support sooner rather than later and that the > > > > necessary > > > > >> > pieces for > > > > >> > >> >> > doing so are already in place. > > > > >> > >> >> > > > > >> > >> >> Hi Anton, > > > > >> > >> >> > > > > >> > >> >> Yes. I'm basically just repeating what we agreed upon in > > 2022 > > > as > > > > >> > part of KIP-833. > > > > >> > >> >> > > > > >> > >> >> > > > > > >> > >> >> > --- > > > > >> > >> >> > > > > > >> > >> >> > I want to make sure I've understood your suggested > > sequence > > > > for > > > > >> > controller > > > > >> > >> >> > node replacement. I hope the mentions of Kubernetes are > > > rather > > > > >> for > > > > >> > examples > > > > >> > >> >> > of how to carry things out, rather than saying "this is > > only > > > > >> > supported on > > > > >> > >> >> > Kubernetes"? > > > > >> > >> >> > > > > >> > >> >> Apache Kafka is supported in lots of environments, > including > > > > >> non-k8s > > > > >> > ones. I was just pointing out that using k8s means that you > > control > > > > your > > > > >> > own DNS resolution, which simplifies matters. If you don't > control > > > DNS > > > > >> > there are some extra steps for changing the quorum voters. > > > > >> > >> >> > > > > >> > >> >> > > > > > >> > >> >> > Given we have three existing nodes as such: > > > > >> > >> >> > > > > > >> > >> >> > - a.local -> 192.168.0.100 > > > > >> > >> >> > - b.local -> 192.168.0.101 > > > > >> > >> >> > - c.local -> 192.168.0.102 > > > > >> > >> >> > > > > > >> > >> >> > As well as a candidate node 192.168.0.103 that we want > to > > > > >> replace > > > > >> > for the > > > > >> > >> >> > role of c.local. > > > > >> > >> >> > > > > > >> > >> >> > 1. Shut down controller process on node .102 (to make > sure > > > we > > > > >> > don't "go > > > > >> > >> >> > back in time"). > > > > >> > >> >> > 2. rsync state from leader to .103. > > > > >> > >> >> > 3. Start controller process on .103. > > > > >> > >> >> > 4. Point the c.local entry at .103. > > > > >> > >> >> > > > > > >> > >> >> > I have a few questions about this sequence: > > > > >> > >> >> > > > > > >> > >> >> > 1. Would this sequence be safe against leadership > changes? > > > > >> > >> >> > > > > > >> > >> >> > > > > >> > >> >> If the leader changes, the new leader should have all of > the > > > > >> > committed entries that the old leader had. > > > > >> > >> >> > > > > >> > >> >> > 2. Does it work > > > > >> > >> >> > > > > >> > >> >> Probably the biggest issue is dealing with "torn writes" > > that > > > > >> happen > > > > >> > because you're copying the current log segment while it's being > > > > written > > > > >> to. > > > > >> > The system should be robust against this. However, we don't > > > regularly > > > > do > > > > >> > this, so there hasn't been a lot of testing. > > > > >> > >> >> > > > > >> > >> >> I think Jose had a PR for improving the handling of this > > which > > > > we > > > > >> > might want to dig up. We'd want the system to auto-truncate the > > > > partial > > > > >> > record at the end of the log, if there is one. > > > > >> > >> >> > > > > >> > >> >> > 3. By "state", do we mean `metadata.log.dir`? Something > > > else? > > > > >> > >> >> > > > > >> > >> >> Yes, the state of the metadata.log.dir. Keep in mind you > > will > > > > need > > > > >> > to change the node ID in meta.properties after copying, of > course. > > > > >> > >> >> > > > > >> > >> >> > 4. What are the effects on cluster availability? (I > think > > > this > > > > >> is > > > > >> > the same > > > > >> > >> >> > as asking what happens if a or b crashes during the > > process, > > > > or > > > > >> if > > > > >> > network > > > > >> > >> >> > partitions occur). > > > > >> > >> >> > > > > >> > >> >> Cluster metadata state tends to be pretty small. > typically a > > > > >> hundred > > > > >> > megabytes or so. Therefore, I do not think it will take more > than > > a > > > > >> second > > > > >> > or two to copy from one node to another. However, if you do > > > > experience a > > > > >> > crash when one node out of three is down, then you will be > > > unavailable > > > > >> > until you can bring up a second node to regain a majority. > > > > >> > >> >> > > > > >> > >> >> > > > > > >> > >> >> > --- > > > > >> > >> >> > > > > > >> > >> >> > If this is considered the official way of handling > > > controller > > > > >> node > > > > >> > >> >> > replacements, does it make sense to improve > documentation > > in > > > > >> this > > > > >> > area? Is > > > > >> > >> >> > there already a plan for this documentation layed out in > > > some > > > > >> > KIPs? This is > > > > >> > >> >> > something I'd be happy to contribute to. > > > > >> > >> >> > > > > > >> > >> >> > > > > >> > >> >> Yes, I think we should have official documentation about > > this. > > > > >> We'd > > > > >> > be happy to review anything in that area. > > > > >> > >> >> > > > > >> > >> >> >> To circle back to KIP-853, I think it stands a good > > chance > > > of > > > > >> > making it > > > > >> > >> >> >> into AK 4.0. > > > > >> > >> >> > > > > > >> > >> >> > This sounds good, but the point I was making was if we > > could > > > > >> have > > > > >> > a release > > > > >> > >> >> > with both KRaft and ZK supporting this feature to ease > the > > > > >> > migration out of > > > > >> > >> >> > ZK. > > > > >> > >> >> > > > > > >> > >> >> > > > > >> > >> >> The problem is, supporting multiple controller > > implementations > > > > is > > > > >> a > > > > >> > huge burden. So we don't want to extend the 3.x release past the > > > point > > > > >> > that's needed to complete all the must-dos (SCRAM, delegation > > > tokens, > > > > >> JBOD) > > > > >> > >> >> > > > > >> > >> >> best, > > > > >> > >> >> Colin > > > > >> > >> >> > > > > >> > >> >> > > > > >> > >> >> > BR, > > > > >> > >> >> > Anton > > > > >> > >> >> > > > > > >> > >> >> > Den tors 9 nov. 2023 kl 23:04 skrev Colin McCabe < > > > > >> > cmcc...@apache.org>: > > > > >> > >> >> > > > > > >> > >> >> >> Hi Anton, > > > > >> > >> >> >> > > > > >> > >> >> >> It rarely makes sense to scale up and down the number > of > > > > >> > controller nodes > > > > >> > >> >> >> in the cluster. Only one controller node will be active > > at > > > > any > > > > >> > given time. > > > > >> > >> >> >> The main reason to use 5 nodes would be to be able to > > > > tolerate > > > > >> 2 > > > > >> > failures > > > > >> > >> >> >> instead of 1. > > > > >> > >> >> >> > > > > >> > >> >> >> At Confluent, we generally run KRaft with 3 > controllers. > > We > > > > >> have > > > > >> > not seen > > > > >> > >> >> >> problems with this setup, even with thousands of > > clusters. > > > We > > > > >> have > > > > >> > >> >> >> discussed using 5 node controller clusters on certain > > very > > > > big > > > > >> > clusters, > > > > >> > >> >> >> but we haven't done that yet. This is all very similar > to > > > ZK, > > > > >> > where most > > > > >> > >> >> >> deployments were 3 nodes as well. > > > > >> > >> >> >> > > > > >> > >> >> >> KIP-853 is not a blocker for either 3.7 or 4.0. We > > > discussed > > > > >> this > > > > >> > in > > > > >> > >> >> >> several KIPs that happened this year and last year. The > > > most > > > > >> > notable was > > > > >> > >> >> >> probably KIP-866, which was approved in May 2022. > > > > >> > >> >> >> > > > > >> > >> >> >> Many users these days run in a Kubernetes environment > > where > > > > >> > Kubernetes > > > > >> > >> >> >> actually controls the DNS. This makes changing the set > of > > > > >> voters > > > > >> > less > > > > >> > >> >> >> important than it was historically. > > > > >> > >> >> >> > > > > >> > >> >> >> For example, in a world with static DNS, you might have > > to > > > > >> change > > > > >> > the > > > > >> > >> >> >> controller.quorum.voters setting from: > > > > >> > >> >> >> > > > > >> > >> >> >> 100@a.local:9073,101@b.local:9073,102@c.local:9073 > > > > >> > >> >> >> > > > > >> > >> >> >> to: > > > > >> > >> >> >> > > > > >> > >> >> >> 100@a.local:9073,101@b.local:9073,102@d.local:9073 > > > > >> > >> >> >> > > > > >> > >> >> >> In a world with k8s controlling the DNS, you simply > remap > > > > >> c.local > > > > >> > to point > > > > >> > >> >> >> ot the IP address of your new pod for controller 102, > and > > > > >> you're > > > > >> > done. No > > > > >> > >> >> >> need to update controller.quorum.voters. > > > > >> > >> >> >> > > > > >> > >> >> >> Another question is whether you re-create the pod data > > from > > > > >> > scratch every > > > > >> > >> >> >> time you add a new node. If you store the controller > data > > > on > > > > an > > > > >> > EBS volume > > > > >> > >> >> >> (or cloud-specific equivalent), you really only have to > > > > detach > > > > >> it > > > > >> > from the > > > > >> > >> >> >> previous pod and re-attach it to the new pod. k8s also > > > > handles > > > > >> > this > > > > >> > >> >> >> automatically, of course. > > > > >> > >> >> >> > > > > >> > >> >> >> If you want to reconstruct the full controller pod > state > > > each > > > > >> > time you > > > > >> > >> >> >> create a new pod (for example, so that you can use only > > > > >> instance > > > > >> > storage), > > > > >> > >> >> >> you should be able to rsync that state from the leader. > > In > > > > >> > general, the > > > > >> > >> >> >> invariant that we want to maintain is that the state > > should > > > > not > > > > >> > "go back in > > > > >> > >> >> >> time" -- if controller 102 promised to hold all log > data > > up > > > > to > > > > >> > offset X, it > > > > >> > >> >> >> should come back with committed data at at least that > > > offset. > > > > >> > >> >> >> > > > > >> > >> >> >> There are lots of new features we'd like to implement > for > > > > >> KRaft, > > > > >> > and Kafka > > > > >> > >> >> >> in general. If you have some you really would like to > > see, > > > I > > > > >> > think everyone > > > > >> > >> >> >> in the community would be happy to work with you. The > > flip > > > > >> side, > > > > >> > of course, > > > > >> > >> >> >> is that since there are an unlimited number of features > > we > > > > >> could > > > > >> > do, we > > > > >> > >> >> >> can't really block the release for any one feature. > > > > >> > >> >> >> > > > > >> > >> >> >> To circle back to KIP-853, I think it stands a good > > chance > > > of > > > > >> > making it > > > > >> > >> >> >> into AK 4.0. Jose, Alyssa, and some other people have > > > worked > > > > on > > > > >> > it. It > > > > >> > >> >> >> definitely won't make it into 3.7, since we have only a > > few > > > > >> weeks > > > > >> > left > > > > >> > >> >> >> before that release happens. > > > > >> > >> >> >> > > > > >> > >> >> >> best, > > > > >> > >> >> >> Colin > > > > >> > >> >> >> > > > > >> > >> >> >> > > > > >> > >> >> >> On Thu, Nov 9, 2023, at 00:20, Anton Agestam wrote: > > > > >> > >> >> >> > Hi Luke, > > > > >> > >> >> >> > > > > > >> > >> >> >> > We have been looking into what switching from ZK to > > KRaft > > > > >> will > > > > >> > mean for > > > > >> > >> >> >> > Aiven. > > > > >> > >> >> >> > > > > > >> > >> >> >> > We heavily depend on an “immutable infrastructure” > > model > > > > for > > > > >> > deployments. > > > > >> > >> >> >> > This means that, when we perform upgrades, we > introduce > > > new > > > > >> > nodes to our > > > > >> > >> >> >> > clusters, scale the cluster up to incorporate the new > > > > nodes, > > > > >> > and then > > > > >> > >> >> >> phase > > > > >> > >> >> >> > the old ones out once all partitions are moved to the > > new > > > > >> > generation. > > > > >> > >> >> >> This > > > > >> > >> >> >> > allows us, and anyone else using a similar model, to > do > > > > >> > upgrades as well > > > > >> > >> >> >> as > > > > >> > >> >> >> > cluster resizing with zero downtime. > > > > >> > >> >> >> > > > > > >> > >> >> >> > Reading up on KRaft and the ZK-to-KRaft migration > path, > > > > this > > > > >> is > > > > >> > somewhat > > > > >> > >> >> >> > worrying for us. It seems like, if KIP-853 is not > > > included > > > > >> > prior to > > > > >> > >> >> >> > dropping support for ZK, we will essentially have no > > > > >> satisfying > > > > >> > upgrade > > > > >> > >> >> >> > path. Even if KIP-853 is included in 4.0, I’m unsure > if > > > > that > > > > >> > would allow > > > > >> > >> >> >> a > > > > >> > >> >> >> > migration path for us, since a new cluster generation > > > would > > > > >> not > > > > >> > be able > > > > >> > >> >> >> to > > > > >> > >> >> >> > use ZK during the migration step. > > > > >> > >> >> >> > On the other hand, if KIP-853 was released in a > version > > > > prior > > > > >> > to dropping > > > > >> > >> >> >> > ZK support, because it allows online resizing of > KRaft > > > > >> > clusters, this > > > > >> > >> >> >> would > > > > >> > >> >> >> > allow us and others that use an immutable > > infrastructure > > > > >> > deployment > > > > >> > >> >> >> model, > > > > >> > >> >> >> > to provide a zero downtime migration path. > > > > >> > >> >> >> > > > > > >> > >> >> >> > For that reason, we’d like to raise awareness around > > this > > > > >> issue > > > > >> > and > > > > >> > >> >> >> > encourage considering the implementation of KIP-853 > or > > > > >> > equivalent a > > > > >> > >> >> >> blocker > > > > >> > >> >> >> > not only for 4.0, but for the last version prior to > > 4.0. > > > > >> > >> >> >> > > > > > >> > >> >> >> > BR, > > > > >> > >> >> >> > Anton > > > > >> > >> >> >> > > > > > >> > >> >> >> > On 2023/10/11 12:17:23 Luke Chen wrote: > > > > >> > >> >> >> >> Hi all, > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> While Kafka 3.6.0 is released, I’d like to start the > > > > >> > discussion for the > > > > >> > >> >> >> >> “road to Kafka 4.0”. Based on the plan in KIP-833 > > > > >> > >> >> >> >> < > > > > >> > >> >> >> > > > > > >> > >> >> >> > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7 > > > > >> > >> >> >> >>, > > > > >> > >> >> >> >> the next release 3.7 will be the final release > before > > > > moving > > > > >> > to Kafka > > > > >> > >> >> >> 4.0 > > > > >> > >> >> >> >> to remove the Zookeeper from Kafka. Before making > this > > > > major > > > > >> > change, I'd > > > > >> > >> >> >> >> like to get consensus on the "must-have > features/fixes > > > for > > > > >> > Kafka 4.0", > > > > >> > >> >> >> to > > > > >> > >> >> >> >> avoid some users being surprised when upgrading to > > Kafka > > > > >> 4.0. > > > > >> > The intent > > > > >> > >> >> >> > is > > > > >> > >> >> >> >> to have a clear communication about what to expect > in > > > the > > > > >> > following > > > > >> > >> >> >> > months. > > > > >> > >> >> >> >> In particular we should be signaling what features > and > > > > >> > configurations > > > > >> > >> >> >> are > > > > >> > >> >> >> >> not supported, or at risk (if no one is able to add > > > > support > > > > >> or > > > > >> > fix known > > > > >> > >> >> >> >> bugs). > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> Here is the JIRA tickets list > > > > >> > >> >> >> >> < > > > > >> > > > > https://issues.apache.org/jira/issues/?jql=labels%20%3D%204.0-blocker > > > > > > > > > >> > >> >> >> I > > > > >> > >> >> >> >> labeled for "4.0-blocker". The criteria I labeled as > > > > >> > “4.0-blocker” are: > > > > >> > >> >> >> >> 1. The feature is supported in Zookeeper Mode, but > not > > > > >> > supported in > > > > >> > >> >> >> KRaft > > > > >> > >> >> >> >> mode, yet (ex: KIP-858: JBOD in KRaft) > > > > >> > >> >> >> >> 2. Critical bugs in KRaft, (ex: KAFKA-15489 : split > > > brain > > > > in > > > > >> > KRaft > > > > >> > >> >> >> >> controller quorum) > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> If you disagree with my current list, welcome to > have > > > > >> > discussion in the > > > > >> > >> >> >> >> specific JIRA ticket. Or, if you think there are > some > > > > >> tickets > > > > >> > I missed, > > > > >> > >> >> >> >> welcome to start a discussion in the JIRA ticket and > > > ping > > > > me > > > > >> > or other > > > > >> > >> >> >> >> people. After we get the consensus, we can > > label/unlabel > > > > it > > > > >> > afterwards. > > > > >> > >> >> >> >> Again, the goal is to have an open communication > with > > > the > > > > >> > community > > > > >> > >> >> >> about > > > > >> > >> >> >> >> what will be coming in 4.0. > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> Below is the high level category of the list > content: > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> 1. Recovery from disk failure > > > > >> > >> >> >> >> KIP-856 > > > > >> > >> >> >> >> < > > > > >> > >> >> >> > > > > > >> > >> >> >> > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery > > > > >> > >> >> >> >>: > > > > >> > >> >> >> >> KRaft Disk Failure Recovery > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> 2. Prevote to support controllers more than 3 > > > > >> > >> >> >> >> KIP-650 > > > > >> > >> >> >> >> < > > > > >> > >> >> >> > > > > > >> > >> >> >> > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics > > > > >> > >> >> >> >>: > > > > >> > >> >> >> >> Enhance Kafkaesque Raft semantics > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> 3. JBOD support > > > > >> > >> >> >> >> KIP-858 > > > > >> > >> >> >> >> < > > > > >> > >> >> >> > > > > > >> > >> >> >> > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft > > > > >> > >> >> >> >>: > > > > >> > >> >> >> >> Handle > > > > >> > >> >> >> >> JBOD broker disk failure in KRaft > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> 4. Scale up/down Controllers > > > > >> > >> >> >> >> KIP-853 > > > > >> > >> >> >> >> < > > > > >> > >> >> >> > > > > > >> > >> >> >> > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes > > > > >> > >> >> >> >>: > > > > >> > >> >> >> >> KRaft Controller Membership Changes > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> 5. Modifying dynamic configurations on the KRaft > > > > controller > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> 6. Critical bugs in KRaft > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> Does this make sense? > > > > >> > >> >> >> >> Any feedback is welcomed. > > > > >> > >> >> >> >> > > > > >> > >> >> >> >> Thank you. > > > > >> > >> >> >> >> Luke > > > > >> > >> >> >> >> > > > > >> > >> >> >> > > > > >> > >> >> > > > > >> > >> > > > > >> > > > > > >> > > > > > > > > > > > > > > > -- [image: Aiven] <https://www.aiven.io> *Josep Prat* Open Source Engineering Director, *Aiven* josep.p...@aiven.io | +491715557497 aiven.io <https://www.aiven.io> | <https://www.facebook.com/aivencloud> <https://www.linkedin.com/company/aiven/> <https://twitter.com/aiven_io> *Aiven Deutschland GmbH* Alexanderufer 3-7, 10117 Berlin Geschäftsführer: Oskari Saarenmaa & Hannu Valtonen Amtsgericht Charlottenburg, HRB 209739 B