Re: [DISCUSS] KIP-1114 Introducing Chunk in Partition

Viktor Somogyi-Vass Tue, 25 Feb 2025 08:10:56 -0800

Hi De Gao,

I think that the concept of chunking can be a viable idea in itself. The
granularity of partitions in some cases may be too coarse, although it can
be managed in most cases with repartitioning (adding partitions to increase
distribution) and tiered storage (offload old and infrequently used data to
cheaper storage).
I collected my main points below and I suggest you flesh out the details
that this KIP is missing and strengthen its motivation because right now I
don't see much of the benefits. I can see why it could be advantageous from
a theoretical perspective, but I think it can add much complexity to the
log layer and Kafka's internals for little benefits for most users in most
cases.


The points I collected while reading your KIP (and sorry if I missed
something that you already answered):
0. In what use-cases chunking is a better choice for users? I think the KIP
lacks user stories and specific use-cases that underpin its validity,
although from a theoretical perspective it may be valid. As I mentioned
above, this is a very complex change in the log layer, so it needs very
strong motivation and needs to point at a very valid pain-point that many
users face today.
1. You don't elaborate how a chunk boundary is defined. Segments have a
segment size and a retention time. Would you define these for chunks as
well or how would you limit the size of a chunk? More broadly I guess I'm
interested in how a chunk translates to a file and how I can
configure the chunks on broker level, topic level, etc.
2. Introducing chunks in the metadata may blow up its complexity. As I see
you specify some metadata changes but I miss a bit more elaboration about
what it would mean for Kafka. I suspect that storing chunk metadata may
mean that the aggregate size of the metadata grows as the data grows. This
may not be desirable for everyone.
3. It also has an impact on recovery, replication and compaction. This
means that we have to rethink how we clean the logs, are there any changes
for log compaction or log recovery. I think your KIP needs to address this
in a detailed way.
4. The KIP currently doesn't address how your proposal co-exist with
classic partitions. In an upgrade scenario we could assume that there will
be a log format change introduced with this, but I'm curious what would an
upgrade mean, how would it happen?
5. You also say that a consumer will essentially reset to the most recent
data. This however assumes a production outage scenario where clients can
miss data instantly on an upgrade. I don't think this is desirable at all
and we must avoid this.
6. Are there any client side protocol changes needed?
7. How chunks work with tiered storage? As I see in the motivation section,
from a theoretical perspective both designs can co-exist, but I miss in
your KIP what interface changes or bigger implementation changes are needed.
8. What happens to the partition reassignment and the tools that already
build on this? Many companies use Cruise Control and may have other similar
home-grown systems. Is you change compatible with the partition
reassignment protocol?

Best,
Viktor

On Tue, Feb 18, 2025 at 10:03 PM De Gao <d...@live.co.uk> wrote:

> Hi Greg:
>
> I see you are from a very practical point of view to evaluate this KIP.
> With tiered storage already merged in it is hard to accept another
> improvement that provide similar solution, even it is better by design and
> enable potential future growth.
> That being said. This is still a community project and the vote is already
> open. So let the community decide what is the way.
> Thank you for your efforts to review this KIP.
>
> De Gao
>
> On 18 February 2025 18:14:27 GMT, Greg Harris <greg.har...@aiven.io.INVALID>
> wrote:
> >Hi De Gao,
> >
> >Thanks for your explanation. It sounds like this feature is appropriate in
> >situations where:
> >
> >1. Low latency to change replicas is a high priority
> >2. External storage is unavailable or undesirable
> >3. Linear growth in metadata size is acceptable to clients, brokers, and
> >controllers
> >4. Data locality loss and backwards incompatibility is acceptable to
> >clients and brokers
> >
> >I would then question whether there are a sufficient number of users in
> >this situation to justify the complexity of maintaining this as an
> upstream
> >feature.
> >
> >In my experience, external storage becomes very desirable or necessary for
> >practical operation of clusters with large amounts of data.
> >I would strongly recommend that you try Tiered Storage as-is and see if it
> >resolves the operational pain that originally motivated this KIP.
> >If you still find Tiered Storage to be deficient, you can use your
> >experience to strengthen and clarify this KIP.
> >
> >Thanks,
> >Greg
> >
> >On Mon, Feb 17, 2025 at 2:25 PM De Gao <d...@live.co.uk> wrote:
> >
> >> Hi Greg:
> >>
> >> Thank you very much for the review.
> >> Let's do more compares with tiered storage.
> >> I agree that the chunk has certain functional overlap with tiered
> storage.
> >> But they are designed from a different perspective. They both want to
> >> address the problem that overtime the partition data is too big and
> hard to
> >> manage.
> >> The tiered storage design principal is to make the data become somebody
> >> else's problem. Data can be provided over an interface. As long as the
> >> interface is implemented properly the external data can be accessed
> >> seamlessly.
> >> The chunk design principal is to extends current Kafka data plane's
> >> capability to be able to handle large partition data, fully inside
> Kafka.
> >> No other systems are needed.
> >> The main benefit is to manage the data fully in side Kafka data plane
> in a
> >> uniformed way. So that there a not too much logic need to be added and
> the
> >> data can be scaled linearly, with the cost of metadata complexity
> (thanks
> >> for the KRaft work to make this a lot easier).
> >> Arguably we could implement a default RemoteStorageManager inside Kafka
> >> using a separated disk system. But 1) as the name indicated it meant to
> be
> >> an external / remote system, and 2) if it is running inside the same JVM
> >> with Kafka certain resources (throughput specifically) need to
> coordinated
> >> with Kafka broker, this increase complexity. And 3) would the data
> managed
> >> the same way as partition data or not (e.g. has segments and indices)?
> If
> >> the same why worth the hassle to add this additional
> RemoteStorageManager
> >> layer? If not why the same partition data managed differently inside the
> >> same broker?
> >> To clarify, there is no intention to replace tiered storage and the
> chunk
> >> meant to work with it. Please see "Automatic chunk deletion" section in
> the
> >> KIP.
> >>
> >> I hope these answer your concerns.
> >> Thanks again for your review!
> >>
> >> De Gao
> >>
> >>
> >> On 16 February 2025 23:43:30 GMT, Greg Harris
> <greg.har...@aiven.io.INVALID>
> >> wrote:
> >> >Hi De Gao,
> >> >
> >> >Thanks for the KIP!
> >> >
> >> >I'd like to re-raise the concerns that David and Justine have made,
> >> >especially the alternative of Tiered Storage and the increase in
> (client)
> >> >metadata complexity.
> >> >I don't think that the KIP contains a satisfactory explanation of why
> this
> >> >change is worth it compared with using Tiered Storage as-is, or with
> >> >marginal improvements as Kamal was suggesting.
> >> >
> >> >> The replicas serve two major function: accepting write and serve
> read.
> >> >But if we observe a replica, we can see that most (not all, I must
> say) of
> >> >the read and write happened only on the end of the replica. This part
> is
> >> >complicated and need to handle with care. The majority part of the
> replica
> >> >are immutable and just serve as a data store (most of the time).
> >> >> But when we manage the replica we manage it as a single piece. Like
> when
> >> >we want to move a replica to a new broker we need to move all the data
> in
> >> >the replica although most of the case we might just interested with
> some
> >> >data at the end.
> >> >> What I am proposing is really to provide an capability to separate
> the
> >> >concern between the data we mostly interested and also complicated to
> >> >manage, with the data we know that are stable and immutable and very
> easy
> >> >to manage.
> >> >
> >> >These statements could all be made in support for Tiered Storage, and
> >> don't
> >> >differentiate Chunks and Tiered Storage.
> >> >
> >> >> If we have this in the first place we don't need tiered storage, as
> >> >adding more brokers / disks will easily hold more data.
> >> >
> >> >I don't think this is a useful hypothetical, because Tiered Storage
> exists
> >> >as a currently supported feature that is already merged, and there are
> no
> >> >current plans to deprecate or remove it.
> >> >You should plan for this feature to coexist with Tiered Storage, and
> >> >identify the core value proposition in that situation.
> >> >
> >> >If you're interested in the benefits of Tiered Storage but don't want
> to
> >> >depend on cloud infrastructure, there are self-hosted object storages.
> Or
> >> >if you want Kafka brokers to not depend on an external service, you may
> >> >choose to implement a new RemoteStorageManager with the properties you
> >> want.
> >> >
> >> >Thanks,
> >> >Greg
> >> >
> >> >On Sat, Jan 25, 2025 at 12:55 PM De Gao <d...@live.co.uk> wrote:
> >> >
> >> >> Hi All:
> >> >>
> >> >> I have updated the KIP to be more specific on the motivation based on
> >> the
> >> >> comments.Please review as you can. Appreciated.
> >> >> If no more review to follow I will submit the KIP for vote.
> >> >> Thank you!
> >> >>
> >> >> On 3 January 2025 22:36:06 GMT, De Gao <d...@live.co.uk> wrote:
> >> >> >Thanks for the review.
> >> >> >This is an interesting idea. Indeed this will significantly reduce
> the
> >> >> data need to be copied. But this may need to take the TTL time to get
> >> the
> >> >> new replica join the ISRs. Also we need to consider how to handle the
> >> >> partitions that will only do purge by data size.
> >> >> >
> >> >> >On 2 January 2025 18:11:49 GMT, Kamal Chandraprakash <
> >> >> kamal.chandraprak...@gmail.com> wrote:
> >> >> >>Hi Deo,
> >> >> >>
> >> >> >>Thanks for the KIP!
> >> >> >>
> >> >> >>"However the limit of messages in a single partition replica is
> very
> >> big.
> >> >> >>This could lead to very big partitions (~TBs). Moving those
> partitions
> >> >> are
> >> >> >>very time consuming and have a big impact on system performance."
> >> >> >>
> >> >> >>One way to do faster rebalance is to have a latest-offset replica
> >> build
> >> >> >>strategy when expanding the replicas for a partition
> >> >> >>and ensure that the expanded replica does not serve as a leader
> until
> >> the
> >> >> >>data in the older nodes expires by retention time/size.
> >> >> >>Currently, Kafka supports only the earliest-offset strategy during
> >> >> >>reassignment. And, this strategy will only work for topics
> >> >> >>with cleanup policy set to "delete".
> >> >> >>
> >> >> >>--
> >> >> >>Kamal
> >> >> >>
> >> >> >>On Thu, Jan 2, 2025 at 10:23 PM David Arthur <mum...@gmail.com>
> >> wrote:
> >> >> >>
> >> >> >>> Hey De Gao, thanks for the KIP!
> >> >> >>>
> >> >> >>> As you’re probably aware, a Partition is a logical construct in
> >> Kafka.
> >> >> A
> >> >> >>> broker hosts a partition which is composed of physical log
> segments.
> >> >> Only
> >> >> >>> the active segment is being written to and the others are
> immutable.
> >> >> The
> >> >> >>> concept of a Chunk sounds quite similar to our log segments.
> >> >> >>>
> >> >> >>> From what I can tell reading the KIP, the main difference is
> that a
> >> >> Chunk
> >> >> >>> can have its own assignment and therefore be replicated across
> >> >> different
> >> >> >>> brokers.
> >> >> >>>
> >> >> >>> > Horizontal scalability: the data was distributed more evenly to
> >> >> brokers
> >> >> >>> in cluster. Also achieving a more flexible resource allocation.
> >> >> >>>
> >> >> >>> I think this is only true in cases where we have a small number
> of
> >> >> >>> partitions with a large amount of data. I have certainly seen
> cases
> >> >> where a
> >> >> >>> small number of partitions can cause trouble with balancing the
> >> >> cluster.
> >> >> >>>
> >> >> >>> The idea of shuffling around older data in order to spread out
> the
> >> >> load is
> >> >> >>> interesting. It does seem like it would increase the complexity
> of
> >> the
> >> >> >>> client a bit when it comes to consuming the old data. Usually the
> >> >> client
> >> >> >>> can just read from a single replica from the beginning of the
> log to
> >> >> the
> >> >> >>> end. With this proposal, the client would need to hop around
> between
> >> >> >>> replicas as it crossed the chunk boundaries.
> >> >> >>>
> >> >> >>> > Better load balancing: The read of partition data, especially
> >> early
> >> >> data
> >> >> >>> can be distributed to more nodes other than just leader nodes.
> >> >> >>>
> >> >> >>> As you know, this is already possible with KIP-392. I guess the
> idea
> >> >> with
> >> >> >>> the chunks is that clients would be reading older data from less
> >> busy
> >> >> >>> brokers (i.e., brokers which are not the leader, or perhaps not
> >> even a
> >> >> >>> follower of the active chunk). I’m not sure this would always
> >> result in
> >> >> >>> better load balancing. It seems a bit situational.
> >> >> >>>
> >> >> >>> > Increased fault tolerance: failure of leader node will not
> impact
> >> >> read
> >> >> >>> older data.
> >> >> >>>
> >> >> >>> I don’t think this proposal changes the fault tolerance. A
> failure
> >> of a
> >> >> >>> leader results in a failover to a follower. If a client is
> consuming
> >> >> using
> >> >> >>> KIP-392, a leader failure will not affect the consumption
> (besides
> >> >> updating
> >> >> >>> the clients metadata).
> >> >> >>>
> >> >> >>> --
> >> >> >>>
> >> >> >>> I guess I'm missing a key point here. What problem is this
> trying to
> >> >> solve?
> >> >> >>> Is it a solution for the "single partition" problem? (i.e., a
> topic
> >> >> with
> >> >> >>> one partition and a lot of data)
> >> >> >>>
> >> >> >>> Thanks!
> >> >> >>> David A
> >> >> >>>
> >> >> >>> On Tue, Dec 31, 2024 at 3:24 PM De Gao <d...@live.co.uk> wrote:
> >> >> >>>
> >> >> >>> > Thanks for the comments. I have updated the proposal to compare
> >> with
> >> >> >>> > tiered storage and fetch from replica. Please check.
> >> >> >>> >
> >> >> >>> > Thanks.
> >> >> >>> >
> >> >> >>> > On 11 December 2024 08:51:43 GMT, David Jacot
> >> >> >>> <dja...@confluent.io.INVALID>
> >> >> >>> > wrote:
> >> >> >>> > >Hi,
> >> >> >>> > >
> >> >> >>> > >Thanks for the KIP. The community is pretty busy with the
> Apache
> >> >> Kafka
> >> >> >>> 4.0
> >> >> >>> > >release so I suppose that no one really had the time to
> engage in
> >> >> >>> > reviewing
> >> >> >>> > >the KIP yet. Sorry for this!
> >> >> >>> > >
> >> >> >>> > >I just read the motivation section. I think that it is an
> >> >> interesting
> >> >> >>> > idea.
> >> >> >>> > >However, I wonder if this is still needed now that we have
> tier
> >> >> storage
> >> >> >>> in
> >> >> >>> > >place. One of the big selling points of tier storage was that
> >> >> clusters
> >> >> >>> > >don't have to replicate tiered data anymore. Could you perhaps
> >> >> extend
> >> >> >>> the
> >> >> >>> > >motivation of the KIP to include tier storage in the
> reflexion?
> >> >> >>> > >
> >> >> >>> > >Best,
> >> >> >>> > >David
> >> >> >>> > >
> >> >> >>> > >On Tue, Dec 10, 2024 at 10:46 PM De Gao <d...@live.co.uk>
> wrote:
> >> >> >>> > >
> >> >> >>> > >> Hi All:
> >> >> >>> > >>
> >> >> >>> > >> There were no discussion in the past week. Just want to
> double
> >> >> check
> >> >> >>> if
> >> >> >>> > I
> >> >> >>> > >> missed anything?
> >> >> >>> > >> What should be the expectations on KIP discussion?
> >> >> >>> > >>
> >> >> >>> > >> Thank you!
> >> >> >>> > >>
> >> >> >>> > >> De Gao
> >> >> >>> > >>
> >> >> >>> > >> On 1 December 2024 19:36:37 GMT, De Gao <d...@live.co.uk>
> >> wrote:
> >> >> >>> > >> >Hi All:
> >> >> >>> > >> >
> >> >> >>> > >> >I would like to start the discussion of KIP-1114
> Introducing
> >> >> Chunk in
> >> >> >>> > >> Partition.
> >> >> >>> > >> >
> >> >> >>> > >> >
> >> >> >>> > >>
> >> >> >>> >
> >> >> >>>
> >> >>
> >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1114%3A+Introducing+Chunk+in+Partition
> >> >> >>> > >> >This KIP is complicated so I expect discussion will take
> >> longer
> >> >> time.
> >> >> >>> > >> >
> >> >> >>> > >> >Thank you in advance.
> >> >> >>> > >> >
> >> >> >>> > >> >De Gao
> >> >> >>> > >>
> >> >> >>> >
> >> >> >>>
> >> >> >>>
> >> >> >>> --
> >> >> >>> David Arthur
> >> >> >>>
> >> >>
> >>
>

Re: [DISCUSS] KIP-1114 Introducing Chunk in Partition

Reply via email to