Hi Ivan, Thanks for the detailed reply. For the discussion on using KIP-1123 to solve producer->broker across AZ traffic cost, I feel KIP-1123 would reduce most of the across-AZ traffic cost between producer and broker. For the topic which doesn’t have enough partitions to distribute, those are usually the small volume topics, the traffic cost on those topic doesn’t matter that much. For the topic where the partitioning matters, those are keyed topics. For the keyed topics, you would need to design something to allow the routing stickiness of message key to partition or broker. And the sequence number in the produce request for idempotent producer is also something you need to worry about.
Kafka is historically a very leader-centric system, many features are implemented using leader or a coordinator. Switching to a leaderless design, many systems needs to be reworked which increase the scope/complexity of the project. If you reduce the scope of the project of just saving across-AZ traffic cost for inter-broker replication, the problem would be simplified a lot. The gist would be using S3 as the intermediary between brokers, the leader broker can handle it without a central batch coordinator. On 2025/09/08 15:43:31 Ivan Yurchenko wrote: > Hi Haiying! > > Here's our answer to your questions: > > > HC1: The design is aiming for both diskless and leaderless. It is probably > > better to focus on one problem in the KIP. I think with KIP-392 (Fetch > > from follower) and KIP-1123 (Rack aware partitioning for Kafka Producer), > > both producer and consumer can read/write to the broker in the same AZ to > > avoid the across-AZ cost. The leader broker is no longer the blocker > > component in the read/write paths (The client can now read/write to a > > different broker). By removing the leader broker in the revised KIP-1150 > > design, you would need to move some logic (e.g. offset assignment) which > > was originally in the leader to the batch coordinator, this adds to the > > latency and complicated the logic (those logic are easier to implement in a > > single leader broker). In the current Kafka design, the control is > > distributed in many leader brokers with no hotspots for the cluster. But > > KIP-1150 is moving the distributed control into a central component: batch > > coordinator which adds a hotspot/bottleneck for the cluster; > > I’d like to clarify on KIP-1123: unfortunately, it doesn’t allow writing to > any broker. Classic partitions still have leaders which are the only point > where write could be done. KIP-1123 allows the producer to pick a partition > in the rack-aware manner, but this comes with obvious limitations: 1) it’s > suitable only for use cases where the partition doesn’t matter and could be > picked arbitrarily; 2) there must be enough partitions (i.e. single partition > topics are not covered, for example). > > Having said that, we do acknowledge that leaderless design comes with > increased latency and some extra complexity. However, we see this truly > leaderless approach as the only way to eliminate producer inter-AZ traffic > for all use cases. > > > HC2: Although the revised KIP-1150 is reusing page cache and local log > > segments in the follower broker to avoid designing another caching system, > > the page cache and local log segment files are built much later in the > > follower where in the current Kafka the page cache was built on the leader > > broker when the produce data arrives. This affects when the consumer will > > be able to read the data if the consumer was connecting to the original > > leader broker; > > That’s true that performance characteristics of the proposed design won’t be > able to match those of the classic Kafka topics. Sometimes, consumers will > have to incur extra latency for reading from the remote storage. Given the > benefits of KIP-1150, we think this is an acceptable trade-off. > > Additionally, the Kafka protocol keeps consumers behind the High Water Mark, > so data which is present on disk/in-cache but not stored durably (e.g. on the > leader node prior to replication/upload) is not permitted to be delivered to > consumers and cannot affect the end-to-end latency. > > > HC3: The latency on acks=1 produce is longer than current Kafka since the > > producer needs to wait longer. Acks=1 performs much better (comparing to > > acks=all) for use cases which can tolerate occasional message loss; > > We’re looking for ways to improve the acks=1 performance and we have some > ideas which we’ll publish as a separate KIP. However, for now we only > acknowledge this and would like to emphasize that KIP-1150 doesn’t have a > goal of matching the classic Kafka topic performance characteristics with the > leaderless design, we generally don’t see this possible. > > > HC4: Although the revised design is leaderless, but there is still a leader > > concept when it comes to uploading closed log segment to tiered storage, > > how was that leader elected and issues surrounding leadership switch? > > This is correct, leadership for tasks like this will exist. This is not a > strict requirement, but it’ll prevent excessive resource utilization without > using complex coordination. It could have been a job queue with locks or a > similar mechanism, but Kafka already has the leader election mechanism, which > we piggyback on. We think having leaders for background tasks still keeps the > design effectively leaderless from the client point of view. > > > HC5: The messaging order is only maintained with the same broker. When the > > producer cares about the message ordering and use message key to order > > messages on the same key, I guess the producer client needs to always send > > the messages with the same key to the same broker. How is that consistent > > routing implemented given there is no leader broker concept anymore. And > > even if those messages are routed to the same broker but at different > > times, how is the ordering maintained? > > Please correct me if I misunderstood, but as I read your question: no, the > messaging order is global, and the producer technically could send every > Produce request to an arbitrary broker without violating ordering, > idempotence, or transaction guarantees. No forwarding required either. > > > HC6: For the topic-based batch coordinator, does the read-only coordinator > > live on each broker? If so, there will be a big fan-out read from that > > metadata topic. > > We’re still finalizing the topic-based batch coordinator design, but the > partition count and replication factor of the underlying topic will be > configurable, and we expect that the load and fan-out on any particular batch > coordinator instance will be manageable. > > > HC7: For the topic-based batch coordinator, is the embedded SQLLite engine > > always needed if the size of the metadata topic is contained? > > If this state is stored as Java data structures in memory, everything is > relatively easy to program. However, having gigabytes of this metadata > wouldn’t be improbable even on moderate workloads, so for this to be > practical it must be stored and operated on disk. Programming this by hand is > challenging and effectively will result in re-implementation on something > like SQLite or other embedded DBMS. We generally don’t see any issue with > SQLite as it’s widely adopted across the industry, well-known for its > reliability and hard-to-beat performance characteristics. > Having said this, we still see SQLite as an implementation detail and could > consider other options if the community is strongly against using it. > > Best, > Ivan > > > On Thu, Sep 4, 2025, at 08:40, Haiying Cai wrote: > > Greg, > > > > Thanks for the revisions on KIP-1150 and KIP-1163. I like the idea of > > reusing KIP-405 tiered storage and Kafka’s strength on using page cache and > > local log segment file which greatly simplifies the design and > > implementation. > > > > I have a few questions: > > > > HC1: The design is aiming for both diskless and leaderless. It is probably > > better to focus on one problem in the KIP. I think with KIP-392 (Fetch > > from follower) and KIP-1123 (Rack aware partitioning for Kafka Producer), > > both producer and consumer can read/write to the broker in the same AZ to > > avoid the across-AZ cost. The leader broker is no longer the blocker > > component in the read/write paths (The client can now read/write to a > > different broker). By removing the leader broker in the revised KIP-1150 > > design, you would need to move some logic (e.g. offset assignment) which > > was originally in the leader to the batch coordinator, this adds to the > > latency and complicated the logic (those logic are easier to implement in a > > single leader broker). In the current Kafka design, the control is > > distributed in many leader brokers with no hotspots for the cluster. But > > KIP-1150 is moving the distributed control into a central component: batch > > coordinator which adds a hotspot/bottleneck for the cluster; > > > > HC2: Although the revised KIP-1150 is reusing page cache and local log > > segments in the follower broker to avoid designing another caching system, > > the page cache and local log segment files are built much later in the > > follower where in the current Kafka the page cache was built on the leader > > broker when the produce data arrives. This affects when the consumer will > > be able to read the data if the consumer was connecting to the original > > leader broker; > > > > HC3: The latency on acks=1 produce is longer than current Kafka since the > > producer needs to wait longer. Acks=1 performs much better (comparing to > > acks=all) for use cases which can tolerate occasional message loss; > > > > HC4: Although the revised design is leaderless, but there is still a leader > > concept when it comes to uploading closed log segment to tiered storage, > > how was that leader elected and issues surrounding leadership switch? > > > > HC5: The messaging order is only maintained with the same broker. When the > > producer cares about the message ordering and use message key to order > > messages on the same key, I guess the producer client needs to always send > > the messages with the same key to the same broker. How is that consistent > > routing implemented given there is no leader broker concept anymore. And > > even if those messages are routed to the same broker but at different > > times, how is the ordering maintained? > > > > HC6: For the topic-based batch coordinator, does the read-only coordinator > > live on each broker? If so, there will be a big fan-out read from that > > metadata topic. > > > > HC7: For the topic-based batch coordinator, is the embedded SQLLite engine > > always needed if the size of the metadata topic is contained? > > > > On 2025/09/03 19:59:48 Greg Harris wrote: > > > Hi all, > > > > > > Thank you all for your questions and design input on KIP-1150. > > > > > > We have just updated KIP-1150 and KIP-1163 with a new design. To summarize > > > the changes: > > > > > > 1. The design prioritizes integrating with the existing KIP-405 Tiered > > > Storage interfaces, permitting data produced to a Diskless topic to be > > > moved to tiered storage. > > > This lowers the scalability requirements for the Batch Coordinator > > > component, and allows Diskless to compose with Tiered Storage plugin > > > features such as encryption and alternative data formats. > > > > > > 2. Consumer fetches are now served from local segments, making use of the > > > indexes, page cache, request purgatory, and zero-copy functionality > > > already > > > built into classic topics. > > > However, local segments are now considered cache elements, do not need to > > > be durably stored, and can be built without contacting any other replicas. > > > > > > 3. The design has been simplified substantially, by removing the previous > > > Diskless consume flow, distributed cache component, and "object > > > compaction/merging" step. > > > > > > The design maintains leaderless produces as enabled by the Batch > > > Coordinator, and the same latency profiles as the earlier design, while > > > being simpler and integrating better into the existing ecosystem. > > > > > > Thanks, and we are eager to hear your feedback on the new design. > > > Greg Harris > > > > > > On Mon, Jul 21, 2025 at 3:30 PM Jun Rao <[email protected]> > > > wrote: > > > > > > > Hi, Jan, > > > > > > > > For me, the main gap of KIP-1150 is the support of all existing client > > > > APIs. Currently, there is no design for supporting APIs like > > > > transactions > > > > and queues. > > > > > > > > Thanks, > > > > > > > > Jun > > > > > > > > On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski > > > > <[email protected]> wrote: > > > > > > > > > Would it be a good time to ask for the current status of this KIP? I > > > > > haven't seen much activity here for the past 2 months, the vote got > > > > vetoed > > > > > but I think the pending questions have been answered since then. > > > > > KIP-1183 > > > > > (AutoMQ's proposal) also didn't have any activity since May. > > > > > > > > > > In my eyes KIP-1150 and KIP-1183 are two real choices that can be > > > > > made, with a coordinator-based approach being by far the dominant one > > > > when > > > > > it comes to market adoption - but all these are standalone products. > > > > > > > > > > I'm a big fan of both approaches, but would hate to see a stall. So > > > > > the > > > > > question is: can we get an update? > > > > > > > > > > Maybe it's time to start another vote? Colin McCabe - have your > > > > > questions > > > > > been answered? If not, is there anything I can do to help? I'm deeply > > > > > familiar with both architectures and have written about both? > > > > > > > > > > Kind regards, > > > > > Jan > > > > > > > > > > On Tue, Jun 24, 2025 at 10:42 AM Stanislav Kozlovski < > > > > > [email protected]> wrote: > > > > > > > > > > > I have some nits - it may be useful to > > > > > > > > > > > > a) group all the KIP email threads in the main one (just a bunch of > > > > links > > > > > > to everything) > > > > > > b) create the email threads > > > > > > > > > > > > It's a bit hard to track it all - for example, I was searching for a > > > > > > discuss thread for KIP-1165 for a while; As far as I can tell, it > > > > doesn't > > > > > > exist yet. > > > > > > > > > > > > Since the KIPs are published (by virtue of having the root KIP be > > > > > > published, having a DISCUSS thread and links to sub-KIPs where were > > > > aimed > > > > > > to move the discussion towards), I think it would be good to create > > > > > DISCUSS > > > > > > threads for them all. > > > > > > > > > > > > Best, > > > > > > Stan > > > > > > > > > > > > On 2025/04/16 11:58:22 Josep Prat wrote: > > > > > > > Hi Kafka Devs! > > > > > > > > > > > > > > We want to start a new KIP discussion about introducing a new > > > > > > > type of > > > > > > > topics that would make use of Object Storage as the primary > > > > > > > source of > > > > > > > storage. However, as this KIP is big we decided to split it into > > > > > multiple > > > > > > > related KIPs. > > > > > > > We have the motivational KIP-1150 ( > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > > > > > ) > > > > > > > that aims to discuss if Apache Kafka should aim to have this type > > > > > > > of > > > > > > > feature at all. This KIP doesn't go onto details on how to > > > > > > > implement > > > > > it. > > > > > > > This follows the same approach used when we discussed KRaft. > > > > > > > > > > > > > > But as we know that it is sometimes really hard to discuss on that > > > > meta > > > > > > > level, we also created several sub-kips (linked in KIP-1150) that > > > > offer > > > > > > an > > > > > > > implementation of this feature. > > > > > > > > > > > > > > We kindly ask you to use the proper DISCUSS threads for each type > > > > > > > of > > > > > > > concern and keep this one to discuss whether Apache Kafka wants to > > > > have > > > > > > > this feature or not. > > > > > > > > > > > > > > Thanks in advance on behalf of all the authors of this KIP. > > > > > > > > > > > > > > ------------------ > > > > > > > Josep Prat > > > > > > > Open Source Engineering Director, Aiven > > > > > > > [email protected] | +491715557497 | aiven.io > > > > > > > Aiven Deutschland GmbH > > > > > > > Alexanderufer 3-7, 10117 Berlin > > > > > > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen, > > > > > > > Anna Richardson, Kenneth Chen > > > > > > > Amtsgericht Charlottenburg, HRB 209739 B > > > > > > > > > > > > > > > > > > > > > > > > > >
