Hi all, We have also thought in a bit more details about transactions and queues, here's the plan.
*Transactions* The support for transactions in *classic topics* is based on precise interactions between three actors: clients (mostly producers, but also consumers), brokers (ReplicaManager and other classes), and transaction coordinators. Brokers also run partition leaders with their local state (ProducerStateManager and others). The high level (some details skipped) workflow is the following. When a transactional Produce request is received by the broker: 1. For each partition, the partition leader checks if a non-empty transaction is running for this partition. This is done using its local state derived from the log metadata (ProducerStateManager, VerificationStateEntry, VerificationGuard). 2. The transaction coordinator is informed about all the partitions that aren’t part of the transaction to include them. 3. The partition leaders do additional transactional checks. 4. The partition leaders append the transactional data to their logs and update some of their state (for example, log the fact that the transaction is running for the partition and its first offset). When the transaction is committed or aborted: 1. The producer contacts the transaction coordinator directly with EndTxnRequest. 2. The transaction coordinator writes PREPARE_COMMIT or PREPARE_ABORT to its log and responds to the producer. 3. The transaction coordinator sends WriteTxnMarkersRequest to the leaders of the involved partitions. 4. The partition leaders write the transaction markers to their logs and respond to the coordinator. 5. The coordinator writes the final transaction state COMPLETE_COMMIT or COMPLETE_ABORT. In classic topics, partitions have leaders and lots of important state necessary for supporting this workflow is local. The main challenge in mapping this to Diskless comes from the fact there are no partition leaders, so the corresponding pieces of state need to be globalized in the batch coordinator. We are already doing this to support idempotent produce. The high level workflow for *diskless topics* would look very similar: 1. For each partition, the broker checks if a non-empty transaction is running for this partition. In contrast to classic topics, this is checked against the batch coordinator with a single RPC. Since a transaction could be uniquely identified with producer ID and epoch, the positive result of this check could be cached locally (for the double configured duration of a transaction, for example). 2. The same: The transaction coordinator is informed about all the partitions that aren’t part of the transaction to include them. 3. No transactional checks are done on the broker side. 4. The broker appends the transactional data to the current shared WAL segment. It doesn’t update any transaction-related state for Diskless topics, because it doesn’t have any. 5. The WAL segment is committed to the batch coordinator like in the normal produce flow. 6. The batch coordinator does the final transactional checks of the batches. This procedure would output the same errors like the partition leader in classic topics would do. I.e. some batches could be rejected. This means, there will potentially be garbage in the WAL segment file in case of transactional errors. This is preferable to doing more network round trips, especially considering the WAL segments will be relatively short-living (see the Greg's update above). When the transaction is committed or aborted: 1. The producer contacts the transaction coordinator directly with EndTxnRequest. 2. The transaction coordinator writes PREPARE_COMMIT or PREPARE_ABORT to its log and responds to the producer. 3. *[NEW]* The transaction coordinator informs the batch coordinator that the transaction is finished. 4. *[NEW]* The batch coordinator saves that the transaction is finished and also inserts the control batches in the corresponding logs of the involved Diskless topics. This happens only on the metadata level, no actual control batches are written to any file. They will be dynamically created on Fetch and other read operations. We could technically write these control batches for real, but this would mean extra produce latency, so it's better just to mark them in the batch coordinator and save these milliseconds. 5. The transaction coordinator sends WriteTxnMarkersRequest to the leaders of the involved partitions. – Now only to classic topics now. 6. The partition leaders of classic topics write the transaction markers to their logs and respond to the coordinator. 7. The coordinator writes the final transaction state COMPLETE_COMMIT or COMPLETE_ABORT. Compared to the non-transactional produce flow, we get: 1. An extra network round trip between brokers and the batch coordinator when a new partition appear in the transaction. To mitigate the impact of them: - The results will be cached. - The calls for multiple partitions in one Produce request will be grouped. - The batch coordinator should be optimized for fast response to these RPCs. - The fact that a single producer normally will communicate with a single broker for the duration of the transaction further reduces the expected number of round trips. 2. An extra round trip between the transaction coordinator and batch coordinator when a transaction is finished. With this proposal, transactions will also be able to span both classic and Diskless topics. *Queues* The share group coordination and management is a side job that doesn't interfere with the topic itself (leadership, replicas, physical storage of records, etc.) and non-queue producers and consumers (Fetch and Produce RPCs, consumer group-related RPCs are not affected.) We don't see any reason why we can't make Diskless topics compatible with share groups the same way as classic topics are. Even on the code level, we don't expect any serious refactoring: the same reading routines are used that are used for fetching (e.g. ReplicaManager.readFromLog). Should the KIPs be modified to include this or it's too implementation-focused? Best regards, Ivan On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote: > Hi all, > > Thank you all for your questions and design input on KIP-1150. > > We have just updated KIP-1150 and KIP-1163 with a new design. To summarize > the changes: > > 1. The design prioritizes integrating with the existing KIP-405 Tiered > Storage interfaces, permitting data produced to a Diskless topic to be > moved to tiered storage. > This lowers the scalability requirements for the Batch Coordinator > component, and allows Diskless to compose with Tiered Storage plugin > features such as encryption and alternative data formats. > > 2. Consumer fetches are now served from local segments, making use of the > indexes, page cache, request purgatory, and zero-copy functionality already > built into classic topics. > However, local segments are now considered cache elements, do not need to > be durably stored, and can be built without contacting any other replicas. > > 3. The design has been simplified substantially, by removing the previous > Diskless consume flow, distributed cache component, and "object > compaction/merging" step. > > The design maintains leaderless produces as enabled by the Batch > Coordinator, and the same latency profiles as the earlier design, while > being simpler and integrating better into the existing ecosystem. > > Thanks, and we are eager to hear your feedback on the new design. > Greg Harris > > On Mon, Jul 21, 2025 at 3:30 PM Jun Rao <j...@confluent.io.invalid> wrote: > > > Hi, Jan, > > > > For me, the main gap of KIP-1150 is the support of all existing client > > APIs. Currently, there is no design for supporting APIs like transactions > > and queues. > > > > Thanks, > > > > Jun > > > > On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski > > <jan.siekier...@kentra.io.invalid> wrote: > > > > > Would it be a good time to ask for the current status of this KIP? I > > > haven't seen much activity here for the past 2 months, the vote got > > vetoed > > > but I think the pending questions have been answered since then. KIP-1183 > > > (AutoMQ's proposal) also didn't have any activity since May. > > > > > > In my eyes KIP-1150 and KIP-1183 are two real choices that can be > > > made, with a coordinator-based approach being by far the dominant one > > when > > > it comes to market adoption - but all these are standalone products. > > > > > > I'm a big fan of both approaches, but would hate to see a stall. So the > > > question is: can we get an update? > > > > > > Maybe it's time to start another vote? Colin McCabe - have your questions > > > been answered? If not, is there anything I can do to help? I'm deeply > > > familiar with both architectures and have written about both? > > > > > > Kind regards, > > > Jan > > > > > > On Tue, Jun 24, 2025 at 10:42 AM Stanislav Kozlovski < > > > stanislavkozlov...@apache.org> wrote: > > > > > > > I have some nits - it may be useful to > > > > > > > > a) group all the KIP email threads in the main one (just a bunch of > > links > > > > to everything) > > > > b) create the email threads > > > > > > > > It's a bit hard to track it all - for example, I was searching for a > > > > discuss thread for KIP-1165 for a while; As far as I can tell, it > > doesn't > > > > exist yet. > > > > > > > > Since the KIPs are published (by virtue of having the root KIP be > > > > published, having a DISCUSS thread and links to sub-KIPs where were > > aimed > > > > to move the discussion towards), I think it would be good to create > > > DISCUSS > > > > threads for them all. > > > > > > > > Best, > > > > Stan > > > > > > > > On 2025/04/16 11:58:22 Josep Prat wrote: > > > > > Hi Kafka Devs! > > > > > > > > > > We want to start a new KIP discussion about introducing a new type of > > > > > topics that would make use of Object Storage as the primary source of > > > > > storage. However, as this KIP is big we decided to split it into > > > multiple > > > > > related KIPs. > > > > > We have the motivational KIP-1150 ( > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics > > > > ) > > > > > that aims to discuss if Apache Kafka should aim to have this type of > > > > > feature at all. This KIP doesn't go onto details on how to implement > > > it. > > > > > This follows the same approach used when we discussed KRaft. > > > > > > > > > > But as we know that it is sometimes really hard to discuss on that > > meta > > > > > level, we also created several sub-kips (linked in KIP-1150) that > > offer > > > > an > > > > > implementation of this feature. > > > > > > > > > > We kindly ask you to use the proper DISCUSS threads for each type of > > > > > concern and keep this one to discuss whether Apache Kafka wants to > > have > > > > > this feature or not. > > > > > > > > > > Thanks in advance on behalf of all the authors of this KIP. > > > > > > > > > > ------------------ > > > > > Josep Prat > > > > > Open Source Engineering Director, Aiven > > > > > josep.p...@aiven.io | +491715557497 | aiven.io > > > > > Aiven Deutschland GmbH > > > > > Alexanderufer 3-7, 10117 Berlin > > > > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen, > > > > > Anna Richardson, Kenneth Chen > > > > > Amtsgericht Charlottenburg, HRB 209739 B > > > > > > > > > > > > > > >