Re: [DISCUSS] KIP-1150 Diskless Topics

Ivan Yurchenko Thu, 02 Oct 2025 05:30:43 -0700

I'm very sorry. It seems the mailing list is stripping the attachments.
I'll post the two links:
https://drive.google.com/file/d/1El1Kl2x8JYt3CxdwD0cZ-n6flzZ5s0Gc/view
https://drive.google.com/file/d/1SxfdZIDwimM9OTGYMHCrshFclCkRNqJm/view


Sorry for the noise in the list. I'll do better next time.

- Ivan


On Thu, Oct 2, 2025, at 14:11, Ivan Yurchenko wrote:
> Apologies, it seems the images didn't attach...
> There were only two, I'm attaching them to this message.
> Sorry for the inconvenience!
> 
> - Ivan
> 
> On Thu, Oct 2, 2025, at 14:06, Ivan Yurchenko wrote:
>> Hi dear Kafka community,
>> 
>> In the initial Diskless proposal, we proposed to have a separate component, 
>> batch/diskless coordinator, whose role would be to centrally manage the 
>> batch and WAL file metadata for diskless topics. This component drew many 
>> reasonable comments from the community about how it would support various 
>> Kafka features (transactions, queues) and its scalability. While we believe 
>> we have good answers to all the expressed concerns, we took a step back and 
>> looked at the problem from a different perspective.
>> 
>> We would like to propose an alternative Diskless design *without a 
>> centralized coordinator*. We believe this approach has potential and propose 
>> to discuss it as it may be more appealing to the community.
>> 
>> Let us explain the idea. Most of the complications with the original 
>> Diskless approach come from one necessary architecture change: globalizing 
>> the local state of partition leader in the batch coordinator. This causes 
>> deviations to the established workflows in various features like produce 
>> idempotence and transactions, queues, retention, etc. These deviations need 
>> to be carefully considered, designed, and later implemented and tested. In 
>> the new approach we want to avoid this by making partition leaders again 
>> responsible for managing their partitions, even in diskless topics.
>> 
>> In classic Kafka topics, batch data and metadata are blended together in the 
>> one partition log. The crux of the Diskless idea is to decouple them and 
>> move data to the remote storage, while keeping metadata somewhere else. 
>> Using the central batch coordinator for managing batch metadata is one way, 
>> but not the only.
>> 
>> Let’s now think about managing metadata for each user partition 
>> independently. Generally partitions are independent and don’t share anything 
>> apart from that their data are mixed in WAL files. If we figure out how to 
>> commit and later delete WAL files safely, we will achieve the necessary 
>> autonomy that allows us to get rid of the central batch coordinator. 
>> Instead, *each diskless user partition will be managed by its leader*, as in 
>> classic Kafka topics. Also like in classic topics, the leader uses the 
>> partition log as the way to persist batch metadata, i.e. the regular batch 
>> header + the information about how to find this batch on remote storage. In 
>> contrast to classic topics, batch data is in remote storage. 
>> 
>> For clarity, let’s compare the three designs:
>> • Classic topics:
>>    • Data and metadata are co-located in the partition log.
>>    • The partition log content: [Batch header (metadata)|Batch data].
>>    • The partition log is replicated to the followers.
>>    • The replicas and leader have local state built from metadata.
>> • Original Diskless:
>>    • Metadata is in the batch coordinator, data is on remote storage.
>>    • The partition state is global in the batch coordinator.
>> • New Diskless:
>>    • Metadata is in the partition log, data is on remote storage.
>>    • Partition log content: [Batch header (metadata)|Batch coordinates on 
>> remote storage].
>>    • The partition log is replicated to the followers.
>>    • The replicas and leader have local state built from metadata.
>> 
>> Let’s consider the produce path. Here’s the reminder of the original 
>> Diskless design:
>> 
>> 
>> The new approach could be depicted as the following:
>> 
>> 
>> As you can see, the main difference is that now instead of a single commit 
>> request to the batch coordinator, we send multiple parallel commit requests 
>> to all the leaders of each partition involved in the WAL file. Each of them 
>> will commit its batches independently, without coordinating with other 
>> leaders and any other components. Batch data is addressed by the WAL file 
>> name, the byte offset and size, which allows partitions to know nothing 
>> about other partitions to access their data in shared WAL files.
>> 
>> The number of partitions involved in a single WAL file may be quite large, 
>> e.g. a hundred. A hundred network requests to commit one WAL file is very 
>> impractical. However, there are ways to reduce this number:
>> 1. Partition leaders are located on brokers. Requests to leaders on one 
>> broker could be grouped together into a single physical network request 
>> (resembling the normal Produce request that may carry batches for many 
>> partitions inside). This will cap the number of network requests to the 
>> number of brokers in the cluster.
>> 2. If we craft the cluster metadata to make producers send their requests to 
>> the right brokers (with respect to AZs), we may achieve the higher 
>> concentration of logical commit requests in physical network requests 
>> reducing the number of the latter ones even further, ideally to one.
>> 
>> Obviously, out of multiple commit requests some may fail or time out for a 
>> variety of reasons. This is fine. Some producers will receive totally or 
>> partially failed responses to their Produce requests, similar to what they 
>> would have received when appending to a classic topic fails or times out. If 
>> a partition experiences problems, other partitions will not be affected 
>> (again, like in classic topics). Of course, the uncommitted data will be 
>> garbage in WAL files. But WAL files are short-lived (batches are constantly 
>> assembled into segments and offloaded to tiered storage), so this garbage 
>> will be eventually deleted.
>> 
>> For safely deleting WAL files we now need to centrally manage them, as this 
>> is the only state and logic that spans multiple partitions. On the diagram, 
>> you can see another commit request called “Commit file (best effort)” going 
>> to the WAL File Manager. This manager will be responsible for the following:
>> 1. Collecting (by requests from brokers) and persisting information about 
>> committed WAL files.
>> 2. To handle potential failures in file information delivery, it will be 
>> doing prefix scan on the remote storage periodically to find and register 
>> unknown files. The period of this scan will be configurable and ideally 
>> should be quite long.
>> 3. Checking with the relevant partition leaders (after a grace period) if 
>> they still have batches in a particular file.
>> 4. Physically deleting files when they aren’t anymore referred to by any 
>> partition.
>> 
>> This new design offers the following advantages:
>> 1. It simplifies the implementation of many Kafka features such as 
>> idempotence, transactions, queues, tiered storage, retention. Now we don’t 
>> need to abstract away and reuse the code from partition leaders in the batch 
>> coordinator. Instead, we will literally use the same code paths in leaders, 
>> with little adaptation. Workflows from classic topics mostly remain 
>> unchanged.
>> For example, it seems that 
>> ReplicaManager.maybeSendPartitionsToTransactionCoordinator  and 
>> KafkaApis.handleWriteTxnMarkersRequest used for transaction support on the 
>> partition leader side could be used for diskless topics with little 
>> adaptation. ProducerStateManager, needed for both idempotent produce and 
>> transactions, would be reused.
>> Another example is share groups support, where the share partition leader, 
>> being co-located with the partition leader, would execute the same logic for 
>> both diskless and classic topics.
>> 2. It returns to the familiar partition-based scaling model, where 
>> partitions are independent.
>> 3. It makes the operation and failure patterns closer to the familiar ones 
>> from classic topics.
>> 4. It opens a straightforward path to seamless switching the topics modes 
>> between diskless and classic.
>> 
>> The rest of the things remain unchanged compared to the previous Diskless 
>> design (after all previous discussions). Such things as local segment 
>> materialization by replicas, the consume path, tiered storage integration, 
>> etc.
>> 
>> If the community finds this design more suitable, we will update the KIP(s) 
>> accordingly and continue working on it. Please let us know what you think.
>> 
>> Best regards,
>> Ivan and Diskless team
>> 
>> On Mon, Sep 29, 2025, at 15:06, Ivan Yurchenko wrote:
>> > Hi Justine,
>> > 
>> > Yes, you're right. We need to track the aborted transactions for in the 
>> > diskless coordinator for as long as the corresponding offsets are there. 
>> > With the tiered storage unification Greg mentioned earlier, this will be 
>> > finite time even for infinite data retention.
>> > 
>> > Best,
>> > Ivan
>> > 
>> > On Wed, Sep 17, 2025, at 19:41, Justine Olshan wrote:
>> > > Hey Ivan,
>> > > 
>> > > Thanks for the response. I think most of what you said made sense, but I
>> > > did have some questions about this part:
>> > > 
>> > > > As we understand this, the partition leader in classic topics forgets
>> > > about a transaction once it’s replicated (HWM overpasses it). The
>> > > transaction coordinator acts like the main guardian, allowing partition
>> > > leaders to do this safely. Please correct me if this is wrong. We think
>> > > about relying on this with the batch coordinator and delete the 
>> > > information
>> > > about a transaction once it’s finished (as there’s no replication and HWM
>> > > advances immediately).
>> > > 
>> > > I didn't quite understand this. In classic topics, we have maps for 
>> > > ongoing
>> > > transactions which remove state when the transaction is completed and an
>> > > aborted transactions index which is retained for much longer. Once the
>> > > transaction is completed, the coordinator is no longer involved in
>> > > maintaining this partition side state, and it is subject to compaction 
>> > > etc.
>> > > Looking back at the outline provided above, I didn't see much about the
>> > > fetch path, so maybe that could be expanded a bit further. I saw the
>> > > following in a response:
>> > > > When the broker constructs a fully valid local segment, all the 
>> > > > necessary
>> > > control batches will be inserted and indices, including the transaction
>> > > index will be built to serve FetchRequests exactly as they are today.
>> > > 
>> > > Based on this, it seems like we need to retain the information about
>> > > aborted txns for longer.
>> > > 
>> > > Thanks,
>> > > Justine
>> > > 
>> > > On Mon, Sep 15, 2025 at 9:43 AM Ivan Yurchenko <[email protected]> wrote:
>> > > 
>> > > > Hi Justine and all,
>> > > >
>> > > > Thank you for your questions!
>> > > >
>> > > > > JO 1. >Since a transaction could be uniquely identified with 
>> > > > > producer ID
>> > > > > and epoch, the positive result of this check could be cached locally
>> > > > > Are we saying that only new transaction version 2 transactions can be
>> > > > used
>> > > > > here? If not, we can't uniquely identify transactions with producer 
>> > > > > id +
>> > > > > epoch
>> > > >
>> > > > You’re right that we (probably unintentionally) focused only on 
>> > > > version 2.
>> > > > We can either limit the support to version 2 or consider using some
>> > > > surrogates to support version 1.
>> > > >
>> > > > > JO 2. >The batch coordinator does the final transactional checks of 
>> > > > > the
>> > > > > batches. This procedure would output the same errors like the 
>> > > > > partition
>> > > > > leader in classic topics would do.
>> > > > > Can you expand on what these checks are? Would you be checking if the
>> > > > > transaction was still ongoing for example?* *
>> > > >
>> > > > Yes, the producer epoch, that the transaction is ongoing, and of course
>> > > > the normal idempotence checks. What the partition leader in the classic
>> > > > topics does before appending a batch to the local log (e.g. in
>> > > > UnifiedLog.maybeStartTransactionVerification and
>> > > > UnifiedLog.analyzeAndValidateProducerState). In Diskless, we 
>> > > > unfortunately
>> > > > cannot do these checks before appending the data to the WAL segment and
>> > > > uploading it, but we can “tombstone” these batches in the batch 
>> > > > coordinator
>> > > > during the final commit.
>> > > >
>> > > > > Is there state about ongoing
>> > > > > transactions in the batch coordinator? I see some other state 
>> > > > > mentioned
>> > > > in
>> > > > > the End transaction section, but it's not super clear what state is
>> > > > stored
>> > > > > and when it is stored.
>> > > >
>> > > > Right, this should have been more explicit. As the partition leader 
>> > > > tracks
>> > > > ongoing transactions for classic topics, the batch coordinator has to 
>> > > > as
>> > > > well. So when a transaction starts and ends, the transaction 
>> > > > coordinator
>> > > > must inform the batch coordinator about this.
>> > > >
>> > > > > JO 3. I didn't see anything about maintaining LSO -- perhaps that 
>> > > > > would
>> > > > be
>> > > > > stored in the batch coordinator?
>> > > >
>> > > > Yes. This could be deduced from the committed batches and other
>> > > > information, but for the sake of performance we’d better store it
>> > > > explicitly.
>> > > >
>> > > > > JO 4. Are there any thoughts about how long transactional state is
>> > > > > maintained in the batch coordinator and how it will be cleaned up?
>> > > >
>> > > > As we understand this, the partition leader in classic topics forgets
>> > > > about a transaction once it’s replicated (HWM overpasses it). The
>> > > > transaction coordinator acts like the main guardian, allowing partition
>> > > > leaders to do this safely. Please correct me if this is wrong. We think
>> > > > about relying on this with the batch coordinator and delete the 
>> > > > information
>> > > > about a transaction once it’s finished (as there’s no replication and 
>> > > > HWM
>> > > > advances immediately).
>> > > >
>> > > > Best,
>> > > > Ivan
>> > > >
>> > > > On Tue, Sep 9, 2025, at 00:38, Justine Olshan wrote:
>> > > > > Hey folks,
>> > > > >
>> > > > > Excited to see some updates related to transactions!
>> > > > >
>> > > > > I had a few questions.
>> > > > >
>> > > > > JO 1. >Since a transaction could be uniquely identified with 
>> > > > > producer ID
>> > > > > and epoch, the positive result of this check could be cached locally
>> > > > > Are we saying that only new transaction version 2 transactions can be
>> > > > used
>> > > > > here? If not, we can't uniquely identify transactions with producer 
>> > > > > id +
>> > > > > epoch
>> > > > >
>> > > > > JO 2. >The batch coordinator does the final transactional checks of 
>> > > > > the
>> > > > > batches. This procedure would output the same errors like the 
>> > > > > partition
>> > > > > leader in classic topics would do.
>> > > > > Can you expand on what these checks are? Would you be checking if the
>> > > > > transaction was still ongoing for example? Is there state about 
>> > > > > ongoing
>> > > > > transactions in the batch coordinator? I see some other state 
>> > > > > mentioned
>> > > > in
>> > > > > the End transaction section, but it's not super clear what state is
>> > > > stored
>> > > > > and when it is stored.
>> > > > >
>> > > > > JO 3. I didn't see anything about maintaining LSO -- perhaps that 
>> > > > > would
>> > > > be
>> > > > > stored in the batch coordinator?
>> > > > >
>> > > > > JO 4. Are there any thoughts about how long transactional state is
>> > > > > maintained in the batch coordinator and how it will be cleaned up?
>> > > > >
>> > > > > On Mon, Sep 8, 2025 at 10:38 AM Jun Rao <[email protected]>
>> > > > wrote:
>> > > > >
>> > > > > > Hi, Greg and Ivan,
>> > > > > >
>> > > > > > Thanks for the update. A few comments.
>> > > > > >
>> > > > > > JR 10. "Consumer fetches are now served from local segments, making
>> > > > use of
>> > > > > > the
>> > > > > > indexes, page cache, request purgatory, and zero-copy functionality
>> > > > already
>> > > > > > built into classic topics."
>> > > > > > JR 10.1 Does the broker build the producer state for each 
>> > > > > > partition in
>> > > > > > diskless topics?
>> > > > > > JR 10.2 For transactional data, the consumer fetches need to know
>> > > > aborted
>> > > > > > records. How is that achieved?
>> > > > > >
>> > > > > > JR 11. "The batch coordinator saves that the transaction is 
>> > > > > > finished
>> > > > and
>> > > > > > also inserts the control batches in the corresponding logs of the
>> > > > involved
>> > > > > > Diskless topics. This happens only on the metadata level, no actual
>> > > > control
>> > > > > > batches are written to any file. "
>> > > > > > A fetch response could include multiple transactional batches. How
>> > > > does the
>> > > > > > broker obtain the information about the ending control batch for 
>> > > > > > each
>> > > > > > batch? Does that mean that a fetch response needs to be built by
>> > > > > > stitching record batches and generated control batches together?
>> > > > > >
>> > > > > > JR 12. Queues: Is there still a share partition leader that all
>> > > > consumers
>> > > > > > are routed to?
>> > > > > >
>> > > > > > JR 13. "Should the KIPs be modified to include this or it's too
>> > > > > > implementation-focused?" It would be useful to include enough 
>> > > > > > details
>> > > > to
>> > > > > > understand correctness and performance impact.
>> > > > > >
>> > > > > > HC5. Henry has a valid point. Requests from a given producer 
>> > > > > > contain a
>> > > > > > sequence number, which is ordered. If a producer sends every 
>> > > > > > Produce
>> > > > > > request to an arbitrary broker, those requests could reach the 
>> > > > > > batch
>> > > > > > coordinator in different order and lead to rejection of the produce
>> > > > > > requests.
>> > > > > >
>> > > > > > Jun
>> > > > > >
>> > > > > > On Thu, Sep 4, 2025 at 12:00 AM Ivan Yurchenko <[email protected]> 
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Hi all,
>> > > > > > >
>> > > > > > > We have also thought in a bit more details about transactions and
>> > > > queues,
>> > > > > > > here's the plan.
>> > > > > > >
>> > > > > > > *Transactions*
>> > > > > > >
>> > > > > > > The support for transactions in *classic topics* is based on 
>> > > > > > > precise
>> > > > > > > interactions between three actors: clients (mostly producers, but
>> > > > also
>> > > > > > > consumers), brokers (ReplicaManager and other classes), and
>> > > > transaction
>> > > > > > > coordinators. Brokers also run partition leaders with their local
>> > > > state
>> > > > > > > (ProducerStateManager and others).
>> > > > > > >
>> > > > > > > The high level (some details skipped) workflow is the following.
>> > > > When a
>> > > > > > > transactional Produce request is received by the broker:
>> > > > > > > 1. For each partition, the partition leader checks if a non-empty
>> > > > > > > transaction is running for this partition. This is done using its
>> > > > local
>> > > > > > > state derived from the log metadata (ProducerStateManager,
>> > > > > > > VerificationStateEntry, VerificationGuard).
>> > > > > > > 2. The transaction coordinator is informed about all the 
>> > > > > > > partitions
>> > > > that
>> > > > > > > aren’t part of the transaction to include them.
>> > > > > > > 3. The partition leaders do additional transactional checks.
>> > > > > > > 4. The partition leaders append the transactional data to their 
>> > > > > > > logs
>> > > > and
>> > > > > > > update some of their state (for example, log the fact that the
>> > > > > > transaction
>> > > > > > > is running for the partition and its first offset).
>> > > > > > >
>> > > > > > > When the transaction is committed or aborted:
>> > > > > > > 1. The producer contacts the transaction coordinator directly 
>> > > > > > > with
>> > > > > > > EndTxnRequest.
>> > > > > > > 2. The transaction coordinator writes PREPARE_COMMIT or
>> > > > PREPARE_ABORT to
>> > > > > > > its log and responds to the producer.
>> > > > > > > 3. The transaction coordinator sends WriteTxnMarkersRequest to 
>> > > > > > > the
>> > > > > > leaders
>> > > > > > > of the involved partitions.
>> > > > > > > 4. The partition leaders write the transaction markers to their 
>> > > > > > > logs
>> > > > and
>> > > > > > > respond to the coordinator.
>> > > > > > > 5. The coordinator writes the final transaction state
>> > > > COMPLETE_COMMIT or
>> > > > > > > COMPLETE_ABORT.
>> > > > > > >
>> > > > > > > In classic topics, partitions have leaders and lots of important
>> > > > state
>> > > > > > > necessary for supporting this workflow is local. The main 
>> > > > > > > challenge
>> > > > in
>> > > > > > > mapping this to Diskless comes from the fact there are no 
>> > > > > > > partition
>> > > > > > > leaders, so the corresponding pieces of state need to be 
>> > > > > > > globalized
>> > > > in
>> > > > > > the
>> > > > > > > batch coordinator. We are already doing this to support 
>> > > > > > > idempotent
>> > > > > > produce.
>> > > > > > >
>> > > > > > > The high level workflow for *diskless topics* would look very
>> > > > similar:
>> > > > > > > 1. For each partition, the broker checks if a non-empty 
>> > > > > > > transaction
>> > > > is
>> > > > > > > running for this partition. In contrast to classic topics, this 
>> > > > > > > is
>> > > > > > checked
>> > > > > > > against the batch coordinator with a single RPC. Since a 
>> > > > > > > transaction
>> > > > > > could
>> > > > > > > be uniquely identified with producer ID and epoch, the positive
>> > > > result of
>> > > > > > > this check could be cached locally (for the double configured
>> > > > duration
>> > > > > > of a
>> > > > > > > transaction, for example).
>> > > > > > > 2. The same: The transaction coordinator is informed about all 
>> > > > > > > the
>> > > > > > > partitions that aren’t part of the transaction to include them.
>> > > > > > > 3. No transactional checks are done on the broker side.
>> > > > > > > 4. The broker appends the transactional data to the current 
>> > > > > > > shared
>> > > > WAL
>> > > > > > > segment. It doesn’t update any transaction-related state for 
>> > > > > > > Diskless
>> > > > > > > topics, because it doesn’t have any.
>> > > > > > > 5. The WAL segment is committed to the batch coordinator like in 
>> > > > > > > the
>> > > > > > > normal produce flow.
>> > > > > > > 6. The batch coordinator does the final transactional checks of 
>> > > > > > > the
>> > > > > > > batches. This procedure would output the same errors like the
>> > > > partition
>> > > > > > > leader in classic topics would do. I.e. some batches could be
>> > > > rejected.
>> > > > > > > This means, there will potentially be garbage in the WAL segment
>> > > > file in
>> > > > > > > case of transactional errors. This is preferable to doing more
>> > > > network
>> > > > > > > round trips, especially considering the WAL segments will be
>> > > > relatively
>> > > > > > > short-living (see the Greg's update above).
>> > > > > > >
>> > > > > > > When the transaction is committed or aborted:
>> > > > > > > 1. The producer contacts the transaction coordinator directly 
>> > > > > > > with
>> > > > > > > EndTxnRequest.
>> > > > > > > 2. The transaction coordinator writes PREPARE_COMMIT or
>> > > > PREPARE_ABORT to
>> > > > > > > its log and responds to the producer.
>> > > > > > > 3. *[NEW]* The transaction coordinator informs the batch 
>> > > > > > > coordinator
>> > > > that
>> > > > > > > the transaction is finished.
>> > > > > > > 4. *[NEW]* The batch coordinator saves that the transaction is
>> > > > finished
>> > > > > > > and also inserts the control batches in the corresponding logs 
>> > > > > > > of the
>> > > > > > > involved Diskless topics. This happens only on the metadata 
>> > > > > > > level, no
>> > > > > > > actual control batches are written to any file. They will be
>> > > > dynamically
>> > > > > > > created on Fetch and other read operations. We could technically
>> > > > write
>> > > > > > > these control batches for real, but this would mean extra produce
>> > > > > > latency,
>> > > > > > > so it's better just to mark them in the batch coordinator and 
>> > > > > > > save
>> > > > these
>> > > > > > > milliseconds.
>> > > > > > > 5. The transaction coordinator sends WriteTxnMarkersRequest to 
>> > > > > > > the
>> > > > > > leaders
>> > > > > > > of the involved partitions. – Now only to classic topics now.
>> > > > > > > 6. The partition leaders of classic topics write the transaction
>> > > > markers
>> > > > > > > to their logs and respond to the coordinator.
>> > > > > > > 7. The coordinator writes the final transaction state
>> > > > COMPLETE_COMMIT or
>> > > > > > > COMPLETE_ABORT.
>> > > > > > >
>> > > > > > > Compared to the non-transactional produce flow, we get:
>> > > > > > > 1. An extra network round trip between brokers and the batch
>> > > > coordinator
>> > > > > > > when a new partition appear in the transaction. To mitigate the
>> > > > impact of
>> > > > > > > them:
>> > > > > > >   - The results will be cached.
>> > > > > > >   - The calls for multiple partitions in one Produce request 
>> > > > > > > will be
>> > > > > > > grouped.
>> > > > > > >   - The batch coordinator should be optimized for fast response 
>> > > > > > > to
>> > > > these
>> > > > > > > RPCs.
>> > > > > > >   - The fact that a single producer normally will communicate 
>> > > > > > > with a
>> > > > > > > single broker for the duration of the transaction further 
>> > > > > > > reduces the
>> > > > > > > expected number of round trips.
>> > > > > > > 2. An extra round trip between the transaction coordinator and 
>> > > > > > > batch
>> > > > > > > coordinator when a transaction is finished.
>> > > > > > >
>> > > > > > > With this proposal, transactions will also be able to span both
>> > > > classic
>> > > > > > > and Diskless topics.
>> > > > > > >
>> > > > > > > *Queues*
>> > > > > > >
>> > > > > > > The share group coordination and management is a side job that
>> > > > doesn't
>> > > > > > > interfere with the topic itself (leadership, replicas, physical
>> > > > storage
>> > > > > > of
>> > > > > > > records, etc.) and non-queue producers and consumers (Fetch and
>> > > > Produce
>> > > > > > > RPCs, consumer group-related RPCs are not affected.) We don't 
>> > > > > > > see any
>> > > > > > > reason why we can't make Diskless topics compatible with share
>> > > > groups the
>> > > > > > > same way as classic topics are. Even on the code level, we don't
>> > > > expect
>> > > > > > any
>> > > > > > > serious refactoring: the same reading routines are used that are
>> > > > used for
>> > > > > > > fetching (e.g. ReplicaManager.readFromLog).
>> > > > > > >
>> > > > > > >
>> > > > > > > Should the KIPs be modified to include this or it's too
>> > > > > > > implementation-focused?
>> > > > > > >
>> > > > > > > Best regards,
>> > > > > > > Ivan
>> > > > > > >
>> > > > > > > On Wed, Sep 3, 2025, at 21:59, Greg Harris wrote:
>> > > > > > > > Hi all,
>> > > > > > > >
>> > > > > > > > Thank you all for your questions and design input on KIP-1150.
>> > > > > > > >
>> > > > > > > > We have just updated KIP-1150 and KIP-1163 with a new design. 
>> > > > > > > > To
>> > > > > > > summarize
>> > > > > > > > the changes:
>> > > > > > > >
>> > > > > > > > 1. The design prioritizes integrating with the existing KIP-405
>> > > > Tiered
>> > > > > > > > Storage interfaces, permitting data produced to a Diskless 
>> > > > > > > > topic
>> > > > to be
>> > > > > > > > moved to tiered storage.
>> > > > > > > > This lowers the scalability requirements for the Batch 
>> > > > > > > > Coordinator
>> > > > > > > > component, and allows Diskless to compose with Tiered Storage
>> > > > plugin
>> > > > > > > > features such as encryption and alternative data formats.
>> > > > > > > >
>> > > > > > > > 2. Consumer fetches are now served from local segments, making 
>> > > > > > > > use
>> > > > of
>> > > > > > the
>> > > > > > > > indexes, page cache, request purgatory, and zero-copy 
>> > > > > > > > functionality
>> > > > > > > already
>> > > > > > > > built into classic topics.
>> > > > > > > > However, local segments are now considered cache elements, do 
>> > > > > > > > not
>> > > > need
>> > > > > > to
>> > > > > > > > be durably stored, and can be built without contacting any 
>> > > > > > > > other
>> > > > > > > replicas.
>> > > > > > > >
>> > > > > > > > 3. The design has been simplified substantially, by removing 
>> > > > > > > > the
>> > > > > > previous
>> > > > > > > > Diskless consume flow, distributed cache component, and "object
>> > > > > > > > compaction/merging" step.
>> > > > > > > >
>> > > > > > > > The design maintains leaderless produces as enabled by the 
>> > > > > > > > Batch
>> > > > > > > > Coordinator, and the same latency profiles as the earlier 
>> > > > > > > > design,
>> > > > while
>> > > > > > > > being simpler and integrating better into the existing 
>> > > > > > > > ecosystem.
>> > > > > > > >
>> > > > > > > > Thanks, and we are eager to hear your feedback on the new 
>> > > > > > > > design.
>> > > > > > > > Greg Harris
>> > > > > > > >
>> > > > > > > > On Mon, Jul 21, 2025 at 3:30 PM Jun Rao 
>> > > > > > > > <[email protected]>
>> > > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi, Jan,
>> > > > > > > > >
>> > > > > > > > > For me, the main gap of KIP-1150 is the support of all 
>> > > > > > > > > existing
>> > > > > > client
>> > > > > > > > > APIs. Currently, there is no design for supporting APIs like
>> > > > > > > transactions
>> > > > > > > > > and queues.
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > >
>> > > > > > > > > Jun
>> > > > > > > > >
>> > > > > > > > > On Mon, Jul 21, 2025 at 3:53 AM Jan Siekierski
>> > > > > > > > > <[email protected]> wrote:
>> > > > > > > > >
>> > > > > > > > > > Would it be a good time to ask for the current status of 
>> > > > > > > > > > this
>> > > > KIP?
>> > > > > > I
>> > > > > > > > > > haven't seen much activity here for the past 2 months, the
>> > > > vote got
>> > > > > > > > > vetoed
>> > > > > > > > > > but I think the pending questions have been answered since
>> > > > then.
>> > > > > > > KIP-1183
>> > > > > > > > > > (AutoMQ's proposal) also didn't have any activity since 
>> > > > > > > > > > May.
>> > > > > > > > > >
>> > > > > > > > > > In my eyes KIP-1150 and KIP-1183 are two real choices that 
>> > > > > > > > > > can
>> > > > be
>> > > > > > > > > > made, with a coordinator-based approach being by far the
>> > > > dominant
>> > > > > > one
>> > > > > > > > > when
>> > > > > > > > > > it comes to market adoption - but all these are standalone
>> > > > > > products.
>> > > > > > > > > >
>> > > > > > > > > > I'm a big fan of both approaches, but would hate to see a
>> > > > stall. So
>> > > > > > > the
>> > > > > > > > > > question is: can we get an update?
>> > > > > > > > > >
>> > > > > > > > > > Maybe it's time to start another vote? Colin McCabe - have 
>> > > > > > > > > > your
>> > > > > > > questions
>> > > > > > > > > > been answered? If not, is there anything I can do to help? 
>> > > > > > > > > > I'm
>> > > > > > deeply
>> > > > > > > > > > familiar with both architectures and have written about 
>> > > > > > > > > > both?
>> > > > > > > > > >
>> > > > > > > > > > Kind regards,
>> > > > > > > > > > Jan
>> > > > > > > > > >
>> > > > > > > > > > On Tue, Jun 24, 2025 at 10:42 AM Stanislav Kozlovski <
>> > > > > > > > > > [email protected]> wrote:
>> > > > > > > > > >
>> > > > > > > > > > > I have some nits - it may be useful to
>> > > > > > > > > > >
>> > > > > > > > > > > a) group all the KIP email threads in the main one (just 
>> > > > > > > > > > > a
>> > > > bunch
>> > > > > > of
>> > > > > > > > > links
>> > > > > > > > > > > to everything)
>> > > > > > > > > > > b) create the email threads
>> > > > > > > > > > >
>> > > > > > > > > > > It's a bit hard to track it all - for example, I was
>> > > > searching
>> > > > > > for
>> > > > > > > a
>> > > > > > > > > > > discuss thread for KIP-1165 for a while; As far as I can
>> > > > tell, it
>> > > > > > > > > doesn't
>> > > > > > > > > > > exist yet.
>> > > > > > > > > > >
>> > > > > > > > > > > Since the KIPs are published (by virtue of having the 
>> > > > > > > > > > > root
>> > > > KIP be
>> > > > > > > > > > > published, having a DISCUSS thread and links to sub-KIPs
>> > > > where
>> > > > > > were
>> > > > > > > > > aimed
>> > > > > > > > > > > to move the discussion towards), I think it would be 
>> > > > > > > > > > > good to
>> > > > > > create
>> > > > > > > > > > DISCUSS
>> > > > > > > > > > > threads for them all.
>> > > > > > > > > > >
>> > > > > > > > > > > Best,
>> > > > > > > > > > > Stan
>> > > > > > > > > > >
>> > > > > > > > > > > On 2025/04/16 11:58:22 Josep Prat wrote:
>> > > > > > > > > > > > Hi Kafka Devs!
>> > > > > > > > > > > >
>> > > > > > > > > > > > We want to start a new KIP discussion about 
>> > > > > > > > > > > > introducing a
>> > > > new
>> > > > > > > type of
>> > > > > > > > > > > > topics that would make use of Object Storage as the 
>> > > > > > > > > > > > primary
>> > > > > > > source of
>> > > > > > > > > > > > storage. However, as this KIP is big we decided to 
>> > > > > > > > > > > > split it
>> > > > > > into
>> > > > > > > > > > multiple
>> > > > > > > > > > > > related KIPs.
>> > > > > > > > > > > > We have the motivational KIP-1150 (
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
>> > > > > > > > > > > )
>> > > > > > > > > > > > that aims to discuss if Apache Kafka should aim to have
>> > > > this
>> > > > > > > type of
>> > > > > > > > > > > > feature at all. This KIP doesn't go onto details on 
>> > > > > > > > > > > > how to
>> > > > > > > implement
>> > > > > > > > > > it.
>> > > > > > > > > > > > This follows the same approach used when we discussed
>> > > > KRaft.
>> > > > > > > > > > > >
>> > > > > > > > > > > > But as we know that it is sometimes really hard to 
>> > > > > > > > > > > > discuss
>> > > > on
>> > > > > > > that
>> > > > > > > > > meta
>> > > > > > > > > > > > level, we also created several sub-kips (linked in
>> > > > KIP-1150)
>> > > > > > that
>> > > > > > > > > offer
>> > > > > > > > > > > an
>> > > > > > > > > > > > implementation of this feature.
>> > > > > > > > > > > >
>> > > > > > > > > > > > We kindly ask you to use the proper DISCUSS threads for
>> > > > each
>> > > > > > > type of
>> > > > > > > > > > > > concern and keep this one to discuss whether Apache 
>> > > > > > > > > > > > Kafka
>> > > > wants
>> > > > > > > to
>> > > > > > > > > have
>> > > > > > > > > > > > this feature or not.
>> > > > > > > > > > > >
>> > > > > > > > > > > > Thanks in advance on behalf of all the authors of this 
>> > > > > > > > > > > > KIP.
>> > > > > > > > > > > >
>> > > > > > > > > > > > ------------------
>> > > > > > > > > > > > Josep Prat
>> > > > > > > > > > > > Open Source Engineering Director, Aiven
>> > > > > > > > > > > > [email protected]   |   +491715557497 | aiven.io
>> > > > > > > > > > > > Aiven Deutschland GmbH
>> > > > > > > > > > > > Alexanderufer 3-7, 10117 Berlin
>> > > > > > > > > > > > Geschäftsführer: Oskari Saarenmaa, Hannu Valtonen,
>> > > > > > > > > > > > Anna Richardson, Kenneth Chen
>> > > > > > > > > > > > Amtsgericht Charlottenburg, HRB 209739 B
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > > 
>> > 
>> 
>

Re: [DISCUSS] KIP-1150 Diskless Topics

Reply via email to