Hi, Viktor and Greg,

Thanks for the reply.

JR1.
1) Thanks for verifying the cost estimation. I noticed a bug in my earlier
calculation. I estimated the per broker network transfer rate at 2MB/sec.
It should be 4MB/sec. If I correct it, the estimated savings are similar to
yours.
The cost for transferring 4MB through the network is 4 * 2 * 10^-5 = $8*
10^-5
If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings are
about 87.5%.
If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings are
62.5%.
Savings are still significantly lower when using RLMM.

"To me it seems like that Greg's previous suggestion for a 15 min rollover
may be a bit too much. With 1 hour we can achieve better cost saving and
less coordinate metadata being stored."
This solves the cost issue, but it has other implications (see point 2)
below).

2) "Yes, I think this is to be expected and a lot depends on the
implementation. Ideally segments or chunks should be cached to minimize the
number of times segments pulled from remote storage."
In a classic topic, when a consumer lags, its requests are served either
from the local cache or from large objects in the object store. With the
current design in a diskless topic, lagging consumer requests might be
served from tiny 500-byte objects. This will significantly slow down the
consumer's catch-up, which is not expected user behavior. Ideally, we don't
want those tiny objects to last more than a few minutes, let alone an hour.

3) "I think if my calculations are correct (and we use a 60 minute window),
then metadata generation should be slower, please see the google sheet I
linked above. I think given that traffic, the current topic based RLMM
should be able to handle it."
Why is a 60 minute window used? RLMM metadata needs to be retained for the
longest retention time among all topics. This means that the retention
window can be weeks instead of 1 hour. This means that RLMM might need to
replay over 100GB of data during reassignment, which is not what it is
designed for.

JR10. "Your example of 100,000 1kb/s partitions is a borderline case, where
there are some configurations which are not viable due to scale or cost,
and some that are. It would be up to the operator to tune their cluster, by
changing diskless.segment.ms
<https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$>,
dividing up the cluster, or switching to a more scalable RLMM
implementation."
A broker with 4MB/sec produce throughput can probably be considered high
throughput. Even with 4K partitions per broker, we could still achieve an
87.5% cost saving as listed above, if we do the right implementation. So,
ideally, it would be useful to support that as well.

JR11. "We had a short conversation with Greg and we came to the conclusion
that because of the explosiveness of diskless metadata, it may be worth
revisiting the merging case as it can indeed buy us some more cost saving
for the added complexity. "
If we support merging in the diskless coordinator, I wonder how useful RLMM
is. It seems simpler to manage all metadata from the object store in a
single place.

Jun

On Mon, Apr 27, 2026 at 4:17 PM Greg Harris <[email protected]> wrote:

> Hi Jun,
>
> Thank you for scrutinizing the scalability of the current
> direct-to-tiered-storage strategy, and its metadata scalability.
>
> One of our implicit assumptions with this design was that users are able
> to choose between the Diskless and Classic mechanisms, and that any
> situations where the Diskless design was deficient, the Classic topics
> could continue to be used.
> This was originally applied to low-latency use-cases, but now also applies
> to low-throughput use-cases too. When the throughput on a topic is low, the
> benefit of using Diskless is also low, because it is proportional to the
> amount of data transferred, and it is more likely that the batch overhead
> of the topics is significant.
> In other words, we've been treating cost-effective support for arbitrarily
> low throughput topics as a non-goal.
>
> Your example of 100,000 1kb/s partitions is a borderline case, where there
> are some configurations which are not viable due to scale or cost, and some
> that are. It would be up to the operator to tune their cluster, by changing
> diskless.segment.ms
> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$>,
> dividing up the cluster, or switching to a more scalable RLMM
> implementation.
>
> Do you think we should have cost-effective support for arbitrarily
> low-throughput partitions in Diskless? How much total demand is there in
> partitions where batches are >1kb but the partition throughput is <1kb/s?
>
> Thanks,
> Greg
>
> On Fri, Apr 24, 2026 at 10:23 AM Viktor Somogyi-Vass <[email protected]>
> wrote:
>
>> Hi Jun,
>>
>> Regarding JR1.
>> We had a short conversation with Greg and we came to the conclusion that
>> because of the explosiveness of diskless metadata, it may be worth
>> revisiting the merging case as it can indeed buy us some more cost saving
>> for the added complexity. Also, it would support smaller topics and we
>> could somewhat manage the tiered storage consolidation costs. I think that
>> we would still need to consolidate WAL segments into tiered storage.
>> Reasons are: to limit WAL metadata, to be able to dynamically
>> enable/disable diskless and to be compatible with existing and future TS
>> improvements.
>> I'll try to refresh KIP-1165 and build it into the calculator above (if
>> it's possible at all :) ) and come back to you.
>> Regardless, I just wanted to give a short update in the meantime, looking
>> forward to your answer.
>>
>> Best,
>> Viktor
>>
>> On Fri, Apr 24, 2026 at 3:46 PM Viktor Somogyi-Vass <
>> [email protected]>
>> wrote:
>>
>> > Hi Jun,
>> >
>> > Thanks for the quick reply.
>> >
>> > JR1.
>> > 1) Thanks for putting the numbers together. While your calculation
>> > seems to be correct in the sense that 6 PUTs would worsen the cost
>> saving
>> > benefits, I think that in a byte for byte comparison there is a bigger
>> > difference. The reason is that the 4 tiered storage puts transfer much
>> more
>> > data compared to the small WAL segments, so in practice there should be
>> > fewer TS puts.
>> > I made a google sheet calculator for this which I'd like to share with
>> > you:
>> >
>> https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906#gid=749470906
>> <https://urldefense.com/v3/__https://docs.google.com/spreadsheets/d/127GOTWfFSN27B5ezif14GPj8KtrghjBqsXG9GG6NxhI/edit?gid=749470906*gid=749470906__;Iw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wNjeT01kw$>
>> > Please copy the sheet to modify the values.
>> > About my findings: I was trying to create a similar cluster model that
>> has
>> > been discussed here previously to see how cost varies over different
>> > segment rollovers.To me it seems like that Greg's previous suggestion
>> for a
>> > 15 min rollover may be a bit too much. With 1 hour we can achieve better
>> > cost saving and less coordinate metadata being stored. I have also
>> tried to
>> > account for the producer batch metadata generated by diskless partitions
>> > but to me it seems like a lower number than Greg's original numbers.
>> >
>> > 2) "Note that local storage could be lost on reassigned partitions. In
>> > that case, lagging reads can only be served from the object store."
>> > Yes, I think this is to be expected and a lot depends on the
>> > implementation. Ideally segments or chunks should be cached to minimize
>> the
>> > number of times segments pulled from remote storage.
>> >
>> > "The 2MB/sec I quoted is for a specific broker. Depending on the broker
>> > instance type, a broker may only be able to handle low 10s of MB/sec of
>> > data. So, 2MB/sec overhead is significant."
>> > Yes, I have indeed misunderstood, however I have updated my calculator
>> > sheet with metadata calculation. Overall, the number of tiered storage
>> > segments created seems to be much lower than in your calculations given
>> the
>> > parameters of the cluster you specified earlier. Please take a look, I'd
>> > like to really understand the thinking here because this is a crucial
>> point.
>> >
>> > 3) I think if my calculations are correct (and we use a 60 minute
>> window),
>> > then metadata generation should be slower, please see the google sheet I
>> > linked above. I think given that traffic, the current topic based RLMM
>> > should be able to handle it.
>> > In the case where we would need to make the RLMM capable of handling a
>> > similar traffic as the diskless coordinator, then you're right, we
>> probably
>> > should consider how we can improve it. I think there are multiple
>> > possibilities as you mentioned, but ideally there should be a common
>> > implementation for metadata coordination that could handle these cases.
>> >
>> > JR7.
>> > Yes, your expectation is totally reasonable, we should expect the get
>> and
>> > put operations to be strongly consistent for the read-after-write
>> > scenarios. And I think that since major cloud providers give strongly
>> > consistent object storages, it should be sufficient for a wide
>> user-group.
>> > So we could shrink the scope of the KIP a bit this way and avoid adding
>> > complexity that is needed mostly on the margin.
>> > I can expect though that "list" can stay eventually consistent as the
>> KIP
>> > relies on it for only garbage collection where it is fine if a few
>> segments
>> > can be collected only in the next iteration.
>> >
>> > JR3.
>> > Since Greg hasn't replied yet, I'll try to catch up with him and
>> formulate
>> > an answer next week.
>> >
>> > Best,
>> > Viktor
>> >
>> > On Tue, Apr 21, 2026 at 8:16 PM Jun Rao via dev <[email protected]>
>> > wrote:
>> >
>> >> Hi, Victor,
>> >>
>> >> Thanks for the reply.
>> >>
>> >> JR1.
>> >> 1)  "So while it seems to be significant that we tripled the number of
>> >> PUTs, cost-wise it doesn't seem to be significant."
>> >> Let's compare the savings achieved by replacing network replication
>> >> transfer with S3 puts in AWS.
>> >> network transfer cost: $0.02/GB = $2 * 10^-5/MB
>> >> S3 put cost: $0.005 per 1000 requests = $0.5 * 10^-5/request
>> >>
>> >> The KIP batches data up to 4MB. So, let's assume that we write 2MB S3
>> >> objects on average.
>> >>
>> >> The cost for transferring 2MB through the network is 2 * 2 * 10^-5 =
>> $4*
>> >> 10^-5
>> >> If it's replaced with 2 S3 puts, the cost is $1 * 10^-5. The savings
>> are
>> >> about 75%.
>> >> If it's replaced with 6 S3 puts, the cost is $3 * 10^-5. The savings
>> are
>> >> 25%. As you can see, the savings are significantly lower.
>> >>
>> >> 2) "Therefore we could expect classic local segments to be present
>> which
>> >> could be used for catching up consumers."
>> >> Note that local storage could be lost on reassigned partitions. In that
>> >> case, lagging reads can only be served from the object store.
>> >>
>> >> "Regarding the amount of metadata: 2MB/sec is well below the 2GB/s
>> >> throughput that Greg calculated previously, so I think it should be
>> >> manageable for a cluster with that amount of throughput,"
>> >> It seems that you didn't make the correct comparison. 2GB/s that Greg
>> >> mentioned is the throughput for the whole cluster. The 2MB/sec I
>> quoted is
>> >> for a specific broker. Depending on the broker instance type, a broker
>> may
>> >> only be able to handle low 10s of MB/sec of data. So, 2MB/sec overhead
>> is
>> >> significant.
>> >>
>> >> 3) "I'd separate it from the discussion of diskless core and perhaps we
>> >> could address it in a separate KIP as it is mostly a redesign of the
>> >> RLMM."
>> >> Those problems don't exist in the existing usage of RLMM. They manifest
>> >> because diskless tries to use RLMM in a way it wasn't designed for
>> (there
>> >> is at least a 20X increase in metadata). It would be useful to consider
>> >> whether fixing those problems in RLMM or using a new approach is
>> >> better. For example, KIP-1164 already introduces a snapshotting
>> mechanism.
>> >> Adding another snapshotting mechanism to RLMM seems redundant.
>> >>
>> >> JR7. A typical object store supports 3 operations: puts, gets and
>> lists.
>> >> Which operations used by diskless can be eventually consistent? I'd
>> expect
>> >> that get should always see the result of the latest put.
>> >>
>> >> Jun
>> >>
>> >> On Mon, Apr 20, 2026 at 8:14 AM Viktor Somogyi-Vass <[email protected]
>> >
>> >> wrote:
>> >>
>> >> > Hi Jun,
>> >> >
>> >> > I'd like to add my thoughts too until Greg has time to respond.
>> >> >
>> >> > JR1. I also think there are shortcomings in the current tiered
>> storage
>> >> > design, around the RLMM.
>> >> > 1) I think this is a correct observation, however if my calculations
>> are
>> >> > correct, it actually comes down to a negligible amount of cost.
>> Taking
>> >> the
>> >> > AWS pricing sheet at
>> >> >
>> >>
>> https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps
>> <https://urldefense.com/v3/__https://aws.amazon.com/s3/pricing/?nc2=h_pr_s3&trk=aebc39a1-139c-43bb-8354-211ac811b83a&sc_channel=ps__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMK8C32Iw$>
>> >> > it seems like the difference between 6 or 2 PUTs per second is ~$52
>> for
>> >> a
>> >> > month. The calculation follows
>> >> > as: 6*60*60*24*30*0.005/1000-2*60*60*24*30*0.005/1000=$51.84. So
>> while
>> >> it
>> >> > seems to be significant that we tripled the number of PUTs,
>> cost-wise it
>> >> > doesn't seem to be significant.
>> >> > 2) Reflecting to your original problem: the tiered storage
>> consolidation
>> >> > process should be continuously running and transforming WAL segments
>> >> into
>> >> > classic logs. Therefore we could expect classic local segments to be
>> >> > present which could be used for catching up consumers. So they would
>> >> only
>> >> > switch to WAL reading when they're close to the end of the log. Since
>> >> this
>> >> > offset space should be cached, the reads from there should be fast.
>> >> > Regarding the amount of metadata: 2MB/sec is well below the 2GB/s
>> >> > throughput that Greg calculated previously, so I think it should be
>> >> > manageable for a cluster with that amount of throughput, although I
>> >> agree
>> >> > with your comment that the current topic based tiered metadata
>> manager
>> >> > isn't optimal and we could develop a better solution.
>> >> > 3) Tied to the previous point, I agree that your comments are
>> absolutely
>> >> > valid, however similarly to that, I'd separate it from the
>> discussion of
>> >> > diskless core and perhaps we could address it in a separate KIP as
>> it is
>> >> > mostly a redesign of the RLMM.
>> >> >
>> >> > JR2. Ack. We will raise a KIP in the near future.
>> >> >
>> >> > JR3. I'd leave answering this to Greg as I don't have too much
>> context
>> >> on
>> >> > this one.
>> >> >
>> >> > JR7. I think this could be similar to the tiered storage design, so
>> any
>> >> > coordinator operation should be strongly consistent (since we're
>> using
>> >> > classic topics there). Therefore the WAL segment storage layer could
>> be
>> >> > eventually consistent as we store its metadata in a strongly
>> consistent
>> >> > manner. I'm not sure though if this was the answer you're looking
>> for?
>> >> >
>> >> > Best,
>> >> > Viktor
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Mar 26, 2026 at 11:43 PM Jun Rao via dev <
>> [email protected]>
>> >> > wrote:
>> >> >
>> >> >> Hi, Greg,
>> >> >>
>> >> >> Thanks for the reply.
>> >> >>
>> >> >> JR1. Rolling log segments every 15 minutes addresses the 3 concerns
>> I
>> >> >> listed, but it introduces some new issues because it doesn't quite
>> fit
>> >> the
>> >> >> design of the current tiered storage. (a) The current tiered storage
>> >> >> design
>> >> >> stores a single partition per object. If we roll a log segment
>> every 15
>> >> >> minutes, with 4K partitions per broker, this means an additional 4
>> S3
>> >> puts
>> >> >> per second. The diskless design aims for 2 S3 puts per second. So,
>> this
>> >> >> triples the S3 put cost and reduces the savings benefits. (b) With
>> Tier
>> >> >> storage, each broker essentially needs to read the tier metadata
>> from
>> >> all
>> >> >> tier metadata partitions if the number of user partitions exceeds
>> 50.
>> >> >> Assuming that we generate 100 bytes of tier metadata per partition
>> >> every
>> >> >> 15
>> >> >> minutes. Assuming that each broker has 4K partitions and a cluster
>> of
>> >> 500
>> >> >> brokers. Each broker needs to receive tier metadata at a rate of
>> 100 *
>> >> 4K
>> >> >> *
>> >> >> 500 / (15 * 60) = 200KB/Sec. For a broker hosting one of the 50 tier
>> >> >> metadata topic partitions, it needs to send out metadata at 100 *
>> 4K *
>> >> 500
>> >> >> / 50 * 500 / (15 * 60) = 2MB/Sec. This increases unnecessary network
>> >> and
>> >> >> CPU overhead. (c) Tier storage doesn't support snapshots. A
>> restarted
>> >> >> broker needs to replay the tier metadata log from the beginning to
>> >> build
>> >> >> the tier metadata state. Suppose that the tier metadata log is kept
>> >> for 7
>> >> >> days. The total amount of tier metadata that needs to be replayed is
>> >> 200KB
>> >> >> * 7 * 24 * 3600 = 120GB.
>> >> >> Does the merging optimization you mentioned address those new
>> >> concerns? If
>> >> >> so, could you describe how it works?
>> >> >>
>> >> >> JR2. It's fine to cover the default partition assignment strategy
>> for
>> >> >> diskless topics in a separate KIP. However, since this is essential
>> for
>> >> >> achieving the cost saving goal, we need a solution before releasing
>> the
>> >> >> diskless KIP.
>> >> >>
>> >> >> JR3. Sounds good. Could you document how this work?
>> >> >>
>> >> >> JR7. Could you describe which parts of the operation can be
>> eventually
>> >> >> consistent?
>> >> >>
>> >> >> Jun
>> >> >>
>> >> >> On Thu, Mar 19, 2026 at 1:35 PM Greg Harris <[email protected]>
>> >> wrote:
>> >> >>
>> >> >> > Hi Jun,
>> >> >> >
>> >> >> > Thanks for your comments!
>> >> >> >
>> >> >> > JR1:
>> >> >> > You are correct that the segment rolling configurations are
>> currently
>> >> >> > critical to balance the scalability of Diskless and Tiered
>> Storage,
>> >> as
>> >> >> > larger roll configurations benefit tiered storage, and smaller
>> roll
>> >> >> > configurations benefit Diskless.
>> >> >> >
>> >> >> > To address your points specifically:
>> >> >> > (1) A Diskless topic which is cost-competitive with an equivalent
>> >> >> Classic
>> >> >> > topic will have a metadata size <1% of the data size. A cluster
>> >> storing
>> >> >> > 360GB of metadata will have >36TB of data under management and a
>> >> >> retention
>> >> >> > of 5hr implies a throughput of >2GB/s. This will require multiple
>> >> >> Diskless
>> >> >> > coordinators, which can share the load of storing the Diskless
>> >> metadata,
>> >> >> > and serving Diskless requests.
>> >> >> > (2) Catching up consumers are intended to be served from tiered
>> >> storage
>> >> >> > and local segment caches. Brokers which are building their local
>> >> segment
>> >> >> > caches will have to read many files, but will amortize those
>> reads by
>> >> >> > receiving data for multiple partitions in a single read.
>> >> >> > (3) This is a fundamental downside of storing data from multiple
>> >> topics
>> >> >> in
>> >> >> > a single object, similar to classic segments. We can implement a
>> >> >> > configurable cluster-wide maximum roll time, which would set the
>> >> slowest
>> >> >> > cadence at which Tiered Storage segments are rolled from Diskless
>> >> >> segments.
>> >> >> > If an individual partition has more aggressive roll settings, it
>> may
>> >> be
>> >> >> > rolled earlier.
>> >> >> > This configuration would permit the cluster operator to
>> approximately
>> >> >> > bound the number of diskless WAL segments, which bounds the total
>> >> size
>> >> >> of
>> >> >> > the WAL segments, disk cache, diskless coordinator state, and
>> >> excessive
>> >> >> > retention window. For example, a diskless.segment.ms
>> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$>
>> of 15 minutes
>> >> >> would
>> >> >> > reduce the metadata storage to 18GB, WAL segments to 1.8TB, and
>> >> permit
>> >> >> > short-retention data to be physically deleted as soon as ~15
>> minutes
>> >> >> after
>> >> >> > being produced.
>> >> >> > Of course, this will reduce the size of the tiered storage
>> segments
>> >> for
>> >> >> > topics that have low throughput, and where segment.ms
>> <https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$>
>> >
>> >> >> > diskless.segment.ms
>> <https://urldefense.com/v3/__http://diskless.segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wOdb3oIbw$>,
>> increasing overhead in the RLMM. We can perform
>> >> >> > merging/optimization of Tiered Storage segments to achieve the
>> >> per-topic
>> >> >> > segment.ms
>> <https://urldefense.com/v3/__http://segment.ms__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPVjk2MJw$>
>> .
>> >> >> > There were some reasons why we retracted the prior file-merging
>> >> >> approach,
>> >> >> > and why merging in tiered storage appears better:
>> >> >> > * Rewriting files requires mutability for existing data, which
>> adds
>> >> >> > complexity. Diskless batches or Remote Log Segments would need to
>> be
>> >> >> made
>> >> >> > mutable, and the remote log will be made mutable in KIP-1272 [1]
>> >> >> > * Because a WAL Segment can contain batches from multiple Diskless
>> >> >> > Coordinators, multiple coordinators must also be involved in the
>> >> merging
>> >> >> > step. The Tiered Storage design has exclusive ownership for remote
>> >> log
>> >> >> > segments within the RLMM.
>> >> >> > * Diskless file merging competes for resources with
>> latency-sensitive
>> >> >> > producers and hot consumers. Tiered storage file merging competes
>> for
>> >> >> > resources with lagging consumers, which are typically less latency
>> >> >> > sensitive.
>> >> >> > * Implementing merging in Tiered Storage allows this optimization
>> to
>> >> >> > benefit both classic topics and diskless topics, covering both
>> high
>> >> and
>> >> >> low
>> >> >> > throughput partitions.
>> >> >> > * Remote log segments may be optimized over much longer time
>> windows
>> >> >> > rather than performing optimization once in the first few hours of
>> >> the
>> >> >> life
>> >> >> > of a WAL segment and then freezing the arrangement of the data
>> until
>> >> it
>> >> >> is
>> >> >> > deleted.
>> >> >> > * File merging will need to rely on heuristics, which should be
>> >> >> > configurable by the user. Multi-partition heuristics are more
>> >> >> complicated
>> >> >> > to describe and reason about than single-partition heuristics.
>> >> >> > What do you think of this alternative?
>> >> >> >
>> >> >> > JR2:
>> >> >> > Yes, the current default partition assignment strategy will need
>> some
>> >> >> > improvement. This problem with Diskless WAL segments is analogous
>> to
>> >> the
>> >> >> > Classic topics’ dense inter-broker connection graph.
>> >> >> > The natural solution to this seems to be some sort of cellular
>> >> design,
>> >> >> > where the replica placements tend to locate partitions in similar
>> >> >> groups.
>> >> >> > Partitions in the same cell can generally share the same WAL
>> Segments
>> >> >> and
>> >> >> > the same Diskless Coordinator requests. This would also benefit
>> >> Classic
>> >> >> > topics, which would need fewer connections and fetch requests.
>> >> >> > Such a feature is out-of-scope of this KIP, and either we will
>> >> publish a
>> >> >> > follow-up KIP, or let operators and community tooling address
>> this.
>> >> >> >
>> >> >> > JR3:
>> >> >> > Yes we will replace the ISR/ELR election logic for diskless
>> topics,
>> >> as
>> >> >> > they no longer rely on replicas for data integrity. We will fully
>> >> model
>> >> >> the
>> >> >> > state/lifecycle of the diskless replicas in KRaft, and choose how
>> we
>> >> >> > display this to clients.
>> >> >> > For backwards compatibility, clients using older metadata requests
>> >> >> should
>> >> >> > see diskless topics, but interpret them as classic topics. We
>> could
>> >> tell
>> >> >> > older clients that the leader is in the ISR, even if it just
>> started
>> >> >> > building its cache.
>> >> >> > For clients using the latest metadata, they should see the true
>> >> state of
>> >> >> > the diskless partition: which nodes can accept
>> >> produce/fetch/sharefetch
>> >> >> > requests, which ranges of offsets are cached on-broker, etc. This
>> >> could
>> >> >> > also be used to break apart the “leader” field into more granular
>> >> >> fields,
>> >> >> > now that leadership has changed meaning.
>> >> >> >
>> >> >> > JR4:
>> >> >> > Yes, we can replace the empty fetch requests to the leader nodes
>> with
>> >> >> > cache hint fields in the requests to the Diskless Coordinator, and
>> >> rely
>> >> >> on
>> >> >> > the coordinator to distribute cache hints to all replicas. This
>> >> should
>> >> >> be
>> >> >> > low-overhead, and eliminate the inter-broker communication for
>> >> brokers
>> >> >> > which only host Diskless topics.
>> >> >> >
>> >> >> > JR5.1:
>> >> >> > You are correct and this text was ambiguous, only specifying that
>> the
>> >> >> > controller waits for the sync to be complete. This section is now
>> >> >> updated
>> >> >> > to explicitly say that local segments are built from object
>> storage.
>> >> >> >
>> >> >> > JR5.2:
>> >> >> > Extending the JR2 discussion, reassignment of diskless topics
>> would
>> >> >> > generally happen within a cell, where the marginal cost of
>> reading an
>> >> >> > additional partition is very low. When cells are re-balanced and a
>> >> >> > partition is migrated between cells, there is a brief time (until
>> the
>> >> >> next
>> >> >> > Tiered Storage segment roll) when the marginal cost is doubled.
>> This
>> >> >> should
>> >> >> > be infrequent and well-amortized by other topics which aren’t
>> being
>> >> >> > re-balanced between cells.
>> >> >> >
>> >> >> > JR6.1:
>> >> >> > We plan to move data from Diskless to Tiered Storage. Once the
>> data
>> >> is
>> >> >> in
>> >> >> > Tiered Storage, it can be compacted using the functionality
>> >> described in
>> >> >> > KIP-1272 [1]
>> >> >> >
>> >> >> > JR6.2:
>> >> >> > We will add details for this soon.
>> >> >> >
>> >> >> > JR7:
>> >> >> > We specify the requirement of eventual consistency to allow
>> Diskless
>> >> >> > Topics to be used with other object storage implementations which
>> >> aren’t
>> >> >> > the three major public clouds, such as self-managed software or
>> >> weaker
>> >> >> > consistency caches.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Greg
>> >> >> >
>> >> >> > [1]
>> >> >> >
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272%3A+Support+compacted+topic+in+tiered+storage
>> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1272*3A*Support*compacted*topic*in*tiered*storage__;JSsrKysrKw!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMraeR_8A$>
>> >> >> >
>> >> >> > On Fri, Mar 6, 2026 at 4:14 PM Jun Rao via dev <
>> [email protected]
>> >> >
>> >> >> > wrote:
>> >> >> >
>> >> >> >> Hi, Ivan,
>> >> >> >>
>> >> >> >> Thanks for the KIP. A few comments below.
>> >> >> >>
>> >> >> >> JR1. I am concerned about the usage of the current tiered
>> storage to
>> >> >> >> control the number of small WAL files. Current tiered storage
>> only
>> >> >> tiers
>> >> >> >> the data when a segment rolls, which can take hours. This causes
>> >> three
>> >> >> >> problems. (1) Much more metadata needs to be stored and
>> maintained,
>> >> >> which
>> >> >> >> increases the cost. Suppose that each segment rolls every 5
>> hours,
>> >> each
>> >> >> >> partition generates 2 WAL files per second and each WAL file's
>> >> metadata
>> >> >> >> takes 100 bytes. Each partition will generate 5 * 3.6K * 2 * 100
>> =
>> >> >> 3.6MB
>> >> >> >> of
>> >> >> >> metadata. In a cluster with 100K partitions, this translates to
>> >> 360GB
>> >> >> of
>> >> >> >> metadata stored on the diskless coordinators. (2) A catching-up
>> >> >> consumer's
>> >> >> >> performance degrades since it's forced to read data from many
>> small
>> >> WAL
>> >> >> >> files. (3) The data in WAL files could be retained much longer
>> than
>> >> >> >> retention time. Since the small WAL files aren't completely
>> deleted
>> >> >> until
>> >> >> >> all partitions' data in it are obsolete, the deletion of the WAL
>> >> files
>> >> >> >> could be delayed by hours or more. If the WAL file includes a
>> >> partition
>> >> >> >> with a low retention time, the retention contract could be
>> violated
>> >> >> >> significantly. The earlier design of the KIP included a separate
>> >> object
>> >> >> >> merging process that combines small WAL files much more
>> aggressively
>> >> >> than
>> >> >> >> tiered storage, which seems to be a much better choice.
>> >> >> >>
>> >> >> >> JR2. I don't think the current default partition assignment
>> strategy
>> >> >> for
>> >> >> >> classic topics works for diskless topics. Current strategy tries
>> to
>> >> >> spread
>> >> >> >> the replicas to as many brokers as possible. For example, if a
>> >> broker
>> >> >> has
>> >> >> >> 100 partitions, their replicas could be spread over 100 brokers.
>> If
>> >> the
>> >> >> >> broker generates a WAL file with 100 partitions, this WAL file
>> will
>> >> be
>> >> >> >> read
>> >> >> >> 100 times, once by each broker. S3 read cost is 1/12 of the cost
>> of
>> >> S3
>> >> >> >> put.
>> >> >> >> This assignment strategy will increase the S3 cost by about 8X,
>> >> which
>> >> >> is
>> >> >> >> prohibitive. We need to design a cost effective assignment
>> strategy
>> >> for
>> >> >> >> diskless topics.
>> >> >> >>
>> >> >> >> JR3. We need to think through the leade election logic with
>> diskless
>> >> >> >> topic.
>> >> >> >> The KIP tries to reuse the ISR logic for class topic, but it
>> doesn't
>> >> >> seem
>> >> >> >> very natural.
>> >> >> >> JR3.1 In classsic topic, the leader is always in ISR. In the
>> >> diskless
>> >> >> >> topic, the KIP says that a leader could be out of sync.
>> >> >> >> JR3.2 The existing leader election logic based on ISR/ELR mainly
>> >> >> retries
>> >> >> >> to
>> >> >> >> preserve previously acknowledged data. With diskless topics,
>> since
>> >> the
>> >> >> >> object store provides durability, this logic seems no longer
>> needed.
>> >> >> The
>> >> >> >> existing min.isr and unclean leader election logic also don't
>> apply.
>> >> >> >>
>> >> >> >> JR4. "Despite that there is no inter-broker replication, replicas
>> >> will
>> >> >> >> still issue FetchRequest to leaders. Leaders will respond with
>> empty
>> >> >> (no
>> >> >> >> records) FetchResponse."
>> >> >> >> This seems unnatural. Could we avoid issuing inter broker fetch
>> >> >> requests
>> >> >> >> for diskless topics?
>> >> >> >>
>> >> >> >> JR5. "The replica reassignment will follow the same flow as in
>> >> classic
>> >> >> >> topic:".
>> >> >> >> JR5.1 Is this true? Since inter broker fetch response is alway
>> >> empty,
>> >> >> it
>> >> >> >> doesn't seem the current reassignment flow works for diskless
>> topic.
>> >> >> Also,
>> >> >> >> since the source of the data is object store, it seems more
>> natural
>> >> >> for a
>> >> >> >> replica to back fill the data from the object store, instead of
>> >> other
>> >> >> >> replicas. This will also incur lower costs.
>> >> >> >> JR5.2 How do we prevent reassignment on diskless topics from
>> causing
>> >> >> the
>> >> >> >> same cost issue described in JR2?
>> >> >> >>
>> >> >> >> JR6." In other functional aspects, diskless topics are
>> >> >> indistinguishable
>> >> >> >> from classic topics. This includes durability guarantees,
>> ordering
>> >> >> >> guarantees, transactional and non-transactional producer API,
>> >> consumer
>> >> >> >> API,
>> >> >> >> consumer groups, share groups, data retention (deletion &
>> compact),"
>> >> >> >> JR6.1 Could you describe how compact diskless topics are
>> supported?
>> >> >> >> JR6.2 Neither this KIP nor KIP-1164 describes the transactional
>> >> >> support in
>> >> >> >> detail.
>> >> >> >>
>> >> >> >> JR7. "Object Storage: A shared, durable, concurrent, and
>> eventually
>> >> >> >> consistent storage supporting arbitrary sized byte values and a
>> >> minimal
>> >> >> >> set
>> >> >> >> of atomic operations: put, delete, list, and ranged get."
>> >> >> >> It seems that the object storage in all three major public clouds
>> >> are
>> >> >> >> strongly consistent.
>> >> >> >>
>> >> >> >> Jun
>> >> >> >>
>> >> >> >> On Mon, Mar 2, 2026 at 5:43 AM Ivan Yurchenko <[email protected]>
>> >> wrote:
>> >> >> >>
>> >> >> >> > Hi all,
>> >> >> >> >
>> >> >> >> > The parent KIP-1150 was voted for and accepted. Let's now
>> focus on
>> >> >> the
>> >> >> >> > technical details presented in this KIP-1163 and also in
>> KIP-1164:
>> >> >> >> Diskless
>> >> >> >> > Coordinator  [1].
>> >> >> >> >
>> >> >> >> > Best,
>> >> >> >> > Ivan
>> >> >> >> >
>> >> >> >> > [1]
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164%3A+Diskless+Coordinator
>> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1164*3A*Diskless*Coordinator__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wPUG7nCtg$>
>> >> >> >> >
>> >> >> >> > On Wed, Apr 23, 2025, at 11:41, Ivan Yurchenko wrote:
>> >> >> >> > > Hi all!
>> >> >> >> > >
>> >> >> >> > > We want to start the discussion thread for KIP-1163: Diskless
>> >> Core
>> >> >> >> [1],
>> >> >> >> > which is a sub-KIP for KIP-1150 [2].
>> >> >> >> > >
>> >> >> >> > > Let's use the main KIP-1150 discuss thread [3] for high-level
>> >> >> >> questions,
>> >> >> >> > motivation, and general direction of the feature and this
>> thread
>> >> for
>> >> >> >> > particular details of implementation.
>> >> >> >> > >
>> >> >> >> > > Best,
>> >> >> >> > > Ivan
>> >> >> >> > >
>> >> >> >> > > [1]
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163%3A+Diskless+Core
>> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1163*3A*Diskless*Core__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wMShS6OOA$>
>> >> >> >> > > [2]
>> >> >> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150%3A+Diskless+Topics
>> <https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1150*3A*Diskless*Topics__;JSsr!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wP36tp67w$>
>> >> >> >> > > [3]
>> >> >> https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d
>> <https://urldefense.com/v3/__https://lists.apache.org/thread/ljxc495nf39myp28pmf77sm2xydwjm6d__;!!Ayb5sqE7!t2RHh2_lmpuV6wxO0CCQLMMuOcTLHitt0IY8HqA28tFdgk8EUF9qkqvS2l-vEXgJv_x1x3jBLey8-wN7nkkcTA$>
>> >> >> >> >
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>

Reply via email to