Hi Luke, We have been looking into what switching from ZK to KRaft will mean for Aiven.
We heavily depend on an “immutable infrastructure” model for deployments. This means that, when we perform upgrades, we introduce new nodes to our clusters, scale the cluster up to incorporate the new nodes, and then phase the old ones out once all partitions are moved to the new generation. This allows us, and anyone else using a similar model, to do upgrades as well as cluster resizing with zero downtime. Reading up on KRaft and the ZK-to-KRaft migration path, this is somewhat worrying for us. It seems like, if KIP-853 is not included prior to dropping support for ZK, we will essentially have no satisfying upgrade path. Even if KIP-853 is included in 4.0, I’m unsure if that would allow a migration path for us, since a new cluster generation would not be able to use ZK during the migration step. On the other hand, if KIP-853 was released in a version prior to dropping ZK support, because it allows online resizing of KRaft clusters, this would allow us and others that use an immutable infrastructure deployment model, to provide a zero downtime migration path. For that reason, we’d like to raise awareness around this issue and encourage considering the implementation of KIP-853 or equivalent a blocker not only for 4.0, but for the last version prior to 4.0. BR, Anton On 2023/10/11 12:17:23 Luke Chen wrote: > Hi all, > > While Kafka 3.6.0 is released, I’d like to start the discussion for the > “road to Kafka 4.0”. Based on the plan in KIP-833 > < https://cwiki.apache.org/confluence/display/KAFKA/KIP-833%3A+Mark+KRaft+as+Production+Ready#KIP833:MarkKRaftasProductionReady-Kafka3.7 >, > the next release 3.7 will be the final release before moving to Kafka 4.0 > to remove the Zookeeper from Kafka. Before making this major change, I'd > like to get consensus on the "must-have features/fixes for Kafka 4.0", to > avoid some users being surprised when upgrading to Kafka 4.0. The intent is > to have a clear communication about what to expect in the following months. > In particular we should be signaling what features and configurations are > not supported, or at risk (if no one is able to add support or fix known > bugs). > > Here is the JIRA tickets list > <https://issues.apache.org/jira/issues/?jql=labels%20%3D%204.0-blocker> I > labeled for "4.0-blocker". The criteria I labeled as “4.0-blocker” are: > 1. The feature is supported in Zookeeper Mode, but not supported in KRaft > mode, yet (ex: KIP-858: JBOD in KRaft) > 2. Critical bugs in KRaft, (ex: KAFKA-15489 : split brain in KRaft > controller quorum) > > If you disagree with my current list, welcome to have discussion in the > specific JIRA ticket. Or, if you think there are some tickets I missed, > welcome to start a discussion in the JIRA ticket and ping me or other > people. After we get the consensus, we can label/unlabel it afterwards. > Again, the goal is to have an open communication with the community about > what will be coming in 4.0. > > Below is the high level category of the list content: > > 1. Recovery from disk failure > KIP-856 > < https://cwiki.apache.org/confluence/display/KAFKA/KIP-856:+KRaft+Disk+Failure+Recovery >: > KRaft Disk Failure Recovery > > 2. Prevote to support controllers more than 3 > KIP-650 > < https://cwiki.apache.org/confluence/display/KAFKA/KIP-650%3A+Enhance+Kafkaesque+Raft+semantics >: > Enhance Kafkaesque Raft semantics > > 3. JBOD support > KIP-858 > < https://cwiki.apache.org/confluence/display/KAFKA/KIP-858%3A+Handle+JBOD+broker+disk+failure+in+KRaft >: > Handle > JBOD broker disk failure in KRaft > > 4. Scale up/down Controllers > KIP-853 > < https://cwiki.apache.org/confluence/display/KAFKA/KIP-853%3A+KRaft+Controller+Membership+Changes >: > KRaft Controller Membership Changes > > 5. Modifying dynamic configurations on the KRaft controller > > 6. Critical bugs in KRaft > > Does this make sense? > Any feedback is welcomed. > > Thank you. > Luke >