Re: [DISCUSS] KIP-1150 Diskless Topics

Ivan Yurchenko Thu, 04 Sep 2025 00:00:14 -0700

Hi all,

We have also thought in a bit more details about transactions and queues, 
here's the plan.

*Transactions*

The support for transactions in *classic topics* is based on precise 
interactions between three actors: clients (mostly producers, but also 
consumers), brokers (ReplicaManager and other classes), and transaction 
coordinators. Brokers also run partition leaders with their local state 
(ProducerStateManager and others). 

The high level (some details skipped) workflow is the following. When a 
transactional Produce request is received by the broker:
1. For each partition, the partition leader checks if a non-empty transaction 
is running for this partition. This is done using its local state derived from 
the log metadata (ProducerStateManager, VerificationStateEntry, 
VerificationGuard).
2. The transaction coordinator is informed about all the partitions that aren’t 
part of the transaction to include them.
3. The partition leaders do additional transactional checks.
4. The partition leaders append the transactional data to their logs and update 
some of their state (for example, log the fact that the transaction is running 
for the partition and its first offset).

When the transaction is committed or aborted:
1. The producer contacts the transaction coordinator directly with 
EndTxnRequest.
2. The transaction coordinator writes PREPARE_COMMIT or PREPARE_ABORT to its 
log and responds to the producer.
3. The transaction coordinator sends WriteTxnMarkersRequest to the leaders of 
the involved partitions.
4. The partition leaders write the transaction markers to their logs and 
respond to the coordinator.
5. The coordinator writes the final transaction state COMPLETE_COMMIT or 
COMPLETE_ABORT.

In classic topics, partitions have leaders and lots of important state 
necessary for supporting this workflow is local. The main challenge in mapping 
this to Diskless comes from the fact there are no partition leaders, so the 
corresponding pieces of state need to be globalized in the batch coordinator. 
We are already doing this to support idempotent produce.

The high level workflow for *diskless topics* would look very similar:
1. For each partition, the broker checks if a non-empty transaction is running 
for this partition. In contrast to classic topics, this is checked against the 
batch coordinator with a single RPC. Since a transaction could be uniquely 
identified with producer ID and epoch, the positive result of this check could 
be cached locally (for the double configured duration of a transaction, for 
example).
2. The same: The transaction coordinator is informed about all the partitions 
that aren’t part of the transaction to include them.
3. No transactional checks are done on the broker side.
4. The broker appends the transactional data to the current shared WAL segment. 
It doesn’t update any transaction-related state for Diskless topics, because it 
doesn’t have any.
5. The WAL segment is committed to the batch coordinator like in the normal 
produce flow.
6. The batch coordinator does the final transactional checks of the batches. 
This procedure would output the same errors like the partition leader in 
classic topics would do. I.e. some batches could be rejected. This means, there 
will potentially be garbage in the WAL segment file in case of transactional 
errors. This is preferable to doing more network round trips, especially 
considering the WAL segments will be relatively short-living (see the Greg's 
update above).

When the transaction is committed or aborted:
1. The producer contacts the transaction coordinator directly with 
EndTxnRequest.
2. The transaction coordinator writes PREPARE_COMMIT or PREPARE_ABORT to its 
log and responds to the producer.
3. *[NEW]* The transaction coordinator informs the batch coordinator that the 
transaction is finished.
4. *[NEW]* The batch coordinator saves that the transaction is finished and 
also inserts the control batches in the corresponding logs of the involved 
Diskless topics. This happens only on the metadata level, no actual control 
batches are written to any file. They will be dynamically created on Fetch and 
other read operations. We could technically write these control batches for 
real, but this would mean extra produce latency, so it's better just to mark 
them in the batch coordinator and save these milliseconds.
5. The transaction coordinator sends WriteTxnMarkersRequest to the leaders of 
the involved partitions. – Now only to classic topics now.
6. The partition leaders of classic topics write the transaction markers to 
their logs and respond to the coordinator.
7. The coordinator writes the final transaction state COMPLETE_COMMIT or 
COMPLETE_ABORT.

Compared to the non-transactional produce flow, we get:
1. An extra network round trip between brokers and the batch coordinator when a 
new partition appear in the transaction. To mitigate the impact of them:
  - The results will be cached.
  - The calls for multiple partitions in one Produce request will be grouped.
  - The batch coordinator should be optimized for fast response to these RPCs.
  - The fact that a single producer normally will communicate with a single 
broker for the duration of the transaction further reduces the expected number 
of round trips.
2. An extra round trip between the transaction coordinator and batch 
coordinator when a transaction is finished.

With this proposal, transactions will also be able to span both classic and 
Diskless topics.

*Queues*

The share group coordination and management is a side job that doesn't 
interfere with the topic itself (leadership, replicas, physical storage of 
records, etc.) and non-queue producers and consumers (Fetch and Produce RPCs, 
consumer group-related RPCs are not affected.) We don't see any reason why we 
can't make Diskless topics compatible with share groups the same way as classic 
topics are. Even on the code level, we don't expect any serious refactoring: 
the same reading routines are used that are used for fetching (e.g. 
ReplicaManager.readFromLog).

Should the KIPs be modified to include this or it's too implementation-focused?

Best regards,
Ivan

On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote:
> Hi all,
> 
> Thank you all for your questions and design input on KIP-1150.
> 
> We have just updated KIP-1150 and KIP-1163 with a new design. To summarize
> the changes:
> 
> 1. The design prioritizes integrating with the existing KIP-405 Tiered
> Storage interfaces, permitting data produced to a Diskless topic to be
> moved to tiered storage.
> This lowers the scalability requirements for the Batch Coordinator
> component, and allows Diskless to compose with Tiered Storage plugin
> features such as encryption and alternative data formats.
> 
> 2. Consumer fetches are now served from local segments, making use of the
> indexes, page cache, request purgatory, and zero-copy functionality already
> built into classic topics.
> However, local segments are now considered cache elements, do not need to
> be durably stored, and can be built without contacting any other replicas.
> 
> 3. The design has been simplified substantially, by removing the previous
> Diskless consume flow, distributed cache component, and "object
> compaction/merging" step.
> 
> The design maintains leaderless produces as enabled by the Batch
> Coordinator, and the same latency profiles as the earlier design, while
> being simpler and integrating better into the existing ecosystem.
> 
> Thanks, and we are eager to hear your feedback on the new design.
> Greg Harris
> 
> On Mon, Jul 21, 2025 at 3:30 PM Jun Rao <[email protected]> wrote:
> 
> > Hi, Jan,
> >
> > For me, the main gap of KIP-1150 is the support of all existing client
> > APIs. Currently, there is no design for supporting APIs like transactions
> > and queues.
> >
> > Thanks,
> >
> > Jun
> >
> > On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski
> > <[email protected]> wrote:
> >
> > > Would it be a good time to ask for the current status of this KIP? I
> > > haven't seen much activity here for the past 2 months, the vote got
> > vetoed
> > > but I think the pending questions have been answered since then. KIP-1183
> > > (AutoMQ's proposal) also didn't have any activity since May.
> > >
> > > In my eyes KIP-1150 and KIP-1183 are two real choices that can be
> > > made, with a coordinator-based approach being by far the dominant one
> > when
> > > it comes to market adoption - but all these are standalone products.
> > >
> > > I'm a big fan of both approaches, but would hate to see a stall. So the
> > > question is: can we get an update?
> > >
> > > Maybe it's time to start another vote? Colin McCabe - have your questions
> > > been answered? If not, is there anything I can do to help? I'm deeply
> > > familiar with both architectures and have written about both?
> > >
> > > Kind regards,
> > > Jan
> > >
> > > On Tue, Jun 24, 2025 at 10:42 AM Stanislav Kozlovski <
> > > [email protected]> wrote:
> > >
> > > > I have some nits - it may be useful to
> > > >
> > > > a) group all the KIP email threads in the main one (just a bunch of
> > links
> > > > to everything)
> > > > b) create the email threads
> > > >
> > > > It's a bit hard to track it all - for example, I was searching for a
> > > > discuss thread for KIP-1165 for a while; As far as I can tell, it
> > doesn't
> > > > exist yet.
> > > >
> > > > Since the KIPs are published (by virtue of having the root KIP be
> > > > published, having a DISCUSS thread and links to sub-KIPs where were
> > aimed
> > > > to move the discussion towards), I think it would be good to create
> > > DISCUSS
> > > > threads for them all.
> > > >
> > > > Best,
> > > > Stan
> > > >
> > > > On 2025/04/16 11:58:22 Josep Prat wrote:
> > > > > Hi Kafka Devs!
> > > > >
> > > > > We want to start a new KIP discussion about introducing a new type of
> > > > > topics that would make use of Object Storage as the primary source of
> > > > > storage. However, as this KIP is big we decided to split it into
> > > multiple
> > > > > related KIPs.
> > > > > We have the motivational KIP-1150 (
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
> > > > )
> > > > > that aims to discuss if Apache Kafka should aim to have this type of
> > > > > feature at all. This KIP doesn't go onto details on how to implement
> > > it.
> > > > > This follows the same approach used when we discussed KRaft.
> > > > >
> > > > > But as we know that it is sometimes really hard to discuss on that
> > meta
> > > > > level, we also created several sub-kips (linked in KIP-1150) that
> > offer
> > > > an
> > > > > implementation of this feature.
> > > > >
> > > > > We kindly ask you to use the proper DISCUSS threads for each type of
> > > > > concern and keep this one to discuss whether Apache Kafka wants to
> > have
> > > > > this feature or not.
> > > > >
> > > > > Thanks in advance on behalf of all the authors of this KIP.
> > > > >
> > > > > ------------------
> > > > > Josep Prat
> > > > > Open Source Engineering Director, Aiven
> > > > > [email protected]   |   +491715557497 | aiven.io
> > > > > Aiven Deutschland GmbH
> > > > > Alexanderufer 3-7, 10117 Berlin
> > > > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
> > > > > Anna Richardson, Kenneth Chen
> > > > > Amtsgericht Charlottenburg, HRB 209739 B
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to