Re: [DISCUSS] KIP-1101: Trigger rebalance on rack topology changes

David Jacot Wed, 16 Apr 2025 01:35:24 -0700

Hi PoAn,

Thanks for your very detailed answers!


DJ05: This seems reasonable. We can look into it more deeply during the
implementation too.

DJ07: Yeah, I see your point. I don't have a strict opinion on this so I am
happy to go with the majority. Could you please at least mention in the
rejected alternative section? It would be good to have a trace of it.

Best,
David

On Wed, Apr 16, 2025 at 4:23 AM PoAn Yang <yangp...@gmail.com> wrote:

> Hi Lucas,
>
> Thank you. Here is vote thread
> https://lists.apache.org/thread/j90htmphjk783p697jjfg1xt8thmy33p.
>
> Regards,
> PoAn
>
> > On Apr 16, 2025, at 12:07 AM, Lucas Brutschy 
> > <lbruts...@confluent.io.INVALID>
> wrote:
> >
> > Hi PoAn,
> >
> > from my side, we are good to move to a vote on this one.
> >
> > Cheers,
> > Lucas
> >
> > On Thu, Apr 10, 2025 at 11:29 AM PoAn Yang <yangp...@gmail.com <mailto:
> yangp...@gmail.com>> wrote:
> >>
> >> Hi David and Lucas,
> >>
> >> Thanks for the review and great suggestion.
> >>
> >> DJ01: Add the statement about rack-aware rebalance was removed in 4.0.
> >>
> >> DJ02: Change ConsumerGroupPartitionMetadataKey and
> >> ConsumerGroupPartitionMetadataValue from removed to deprecated.
> >>
> >> DJ03: In my sample implementation <
> https://github.com/apache/kafka/pull/17444>, the cache is maintained in
> GroupCoordinatorManager.
> >> A GroupCoordinatorShard only have one GroupCoordinatorManager, so I
> think it’s similar to
> >> a cache per GroupCoordinatorShard instance.
> >>
> >> DJ04: Add the order about how topic data will be computed to get a
> topic hash.
> >>
> >> DJ05: This is a good question. In topic hash, the Guava uses “final
> mix” to ensure avalanche
> >> effect. However for the single hash of subscribed topics in a group, we
> cannot use simple
> >> sum, because we need to avoid the edge case. For example, hash(a+b) and
> hash(b+a) is
> >> same. To fix this case, the hash function sorts topics by name and the
> topic hash multiplied
> >> by the position value like hash(a + 2b) and hash(b + 2a). I add this
> case and regular expression case to the KIP.
> >>
> >> DJ06: It’s good to add tombstone of ConsumerGroupPartitionMetadata, so
> log compaction
> >> can remove old data. Add this part to compatibility section.
> >>
> >> DJ07: In previous discussion, you mentioned that assignors are sticky.
> If the topology of the
> >> group is the same, the upgrade/downgrade rebalance doesn’t change
> anything.
> >>
> >>>>>>>>>>>> DJ08: In my opinion, changing the format will be rare so it is
> >>>>>>>>>>>> acceptable if rebalances are triggered in this case on
> >>>>>>>>>>>> upgrade/downgrade. It is also what will happen if a cluster is
> >>>>>>>>>>>> downgraded from 4.1 (with this KIP) to 4.0. The rebalance
> won't
> >>>>>> change
> >>>>>>>>>>>> anything if the topology of the group is the same because
> >>>> assignors
> >>>>>>>>>>>> are sticky. The default ones are and we recommend custom ones
> to
> >>>>>> also
> >>>>>>>>>>>> be.
> >>
> >> With `group.version` change, we need to maintain two code in 4.1. One
> is subscription
> >> metadata without rack and another is group hash. Another drawback is
> that changing
> >> `group.version` will trigger all group rebalance at the same time.
> IMHO, I would prefer
> >> to keep the code simple.
> >>
> >> DJ08: Add ConsumerGroupHeartbeatRequestTest to test plan.
> >>
> >> LB01: Add the removal of StreamGroupPartitionMetadataKey /
> >> StreamGrouprtitionMetadataValue to public interface change section.
> >>
> >> Thank you.
> >>
> >> Regards,
> >> PoAn
> >>
> >>> On Apr 9, 2025, at 5:11 PM, Lucas Brutschy 
> >>> <lbruts...@confluent.io.INVALID>
> wrote:
> >>>
> >>> Hi PoAn,
> >>>
> >>> LB01: please add the removal of StreamsGroupPartitionMetadataKey /
> >>> StreamsGroupPartitionMetadataValue to the KIP. Since these were not
> >>> yet included in 4.0, we can just remove them as long as this KIP hits
> >>> 4.1. I can help with the integration into the `streamsGroupHeartbeat`,
> >>> as it will be a little different, but we do not need to call this out
> >>> in the KIP.
> >>>
> >>> Thanks for your work on this!
> >>>
> >>> Cheers,
> >>> Lucas
> >>>
> >>> On Wed, Apr 9, 2025 at 9:59 AM David Jacot <dja...@confluent.io.invalid>
> wrote:
> >>>>
> >>>> Hi PoAn,
> >>>>
> >>>> I finally had a bit of time to review your KIP. First, let me say
> that the
> >>>> KIP is well written! I have some more comments:
> >>>>
> >>>> DJ01: In the motivation, could you please mention that we actually
> removed
> >>>> triggering rebalances based on rack changes in 4.0? You explained
> very well
> >>>> why we removed it but it is not clear that we removed it.
> >>>>
> >>>> DJ02: We cannot delete ConsumerGroupPartitionMetadataKey
> >>>> and ConsumerGroupPartitionMetadataValue because the coordinator must
> still
> >>>> be able to read the record from the log when the cluster is upgraded.
> >>>> However, we can deprecate them.
> >>>>
> >>>> DJ03: Could you clarify how the cache will be scoped? My
> understanding is
> >>>> that we will maintain a cache per GroupCoordinatorShard instance. Is
> this
> >>>> correct?
> >>>>
> >>>> DJ04: Regarding the topic hash function, would it be possible to
> specify
> >>>> how the hash will be computed instead of providing the signature of
> the
> >>>> method?
> >>>>
> >>>> DJ05: I have a general question about how you propose to combine
> hashes. Is
> >>>> using a sum enough? I was looking at the Guava implementation and
> they also
> >>>> apply a "final mix" to ensure an avalanche effect. It would be great
> to
> >>>> elaborate a bit more on this in the KIP.
> >>>>
> >>>> DJ06: After the upgrade, we need to ensure that a tombstone is
> written for
> >>>> the ConsumerGroupPartitionMetadata record if it was present. I
> suppose that
> >>>> this is something that we could do when the next metadata hash is
> computed.
> >>>> Have you thought about it?
> >>>>
> >>>> DJ07: I wonder whether we should gate the change with
> `group.version`. My
> >>>> concern is that during upgrade, not all the coordinators will support
> the
> >>>> new way and groups may bounce between those coordinators as their
> >>>> underlying __consumer_offsets partitions are moved within the cluster
> >>>> (during the roll). This may trigger unnecessary rebalances. One way to
> >>>> avoid this is to gate the change with `group.version` which is only
> bumped
> >>>> at the end of the cluster upgrade process. What do you think?
> >>>>
> >>>> DJ08: Regarding the test plan, we should also extend
> >>>> ConsumerGroupHeartbeatRequestTest to cover the new feature.
> >>>>
> >>>> Thanks for your hard work on this one. Let's try to get it in 4.1.
> >>>>
> >>>> Best,
> >>>> David
> >>>>
> >>>> On Mon, Mar 24, 2025 at 8:17 PM Jeff Kim
> <jeff....@confluent.io.invalid>
> >>>> wrote:
> >>>>
> >>>>> Hi PoAn,
> >>>>>
> >>>>> Thanks for the clarification!
> >>>>>
> >>>>> Best,
> >>>>> Jeff
> >>>>>
> >>>>> On Mon, Mar 24, 2025 at 11:43 AM PoAn Yang <yangp...@gmail.com>
> wrote:
> >>>>>
> >>>>>> Hi Jeff,
> >>>>>>
> >>>>>> Thanks for the review.
> >>>>>>
> >>>>>> JK01. Yes, we already maintain `groupByTopics` data structure in the
> >>>>>> coordinator. If a group leaves,
> >>>>>> it checks whether a topic is subscribed by any group. If it’s not,
> remove
> >>>>>> topic hash from the cache.
> >>>>>> Please check the sample implementation about
> >>>>>> GroupMetadataManager#unsubscribeGroupFromTopic
> >>>>>> in #17444 <https://github.com/apache/kafka/pull/17444>.
> >>>>>>
> >>>>>> JK02: This KIP replaces subscription metadata by a group hash, so
> the
> >>>>>> coordinator calculates group
> >>>>>> hash where we calculated subscription metadata. The trigger points
> are
> >>>>>> like a group member is updated,
> >>>>>> , new group member, new subscribed topic names, new regular
> expressions,
> >>>>>> and new metadata (topic change).
> >>>>>>
> >>>>>> Best Regards,
> >>>>>> PoAn
> >>>>>>
> >>>>>>> On Mar 22, 2025, at 3:25 AM, Jeff Kim
> <jeff....@confluent.io.INVALID>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hi PoAn,
> >>>>>>>
> >>>>>>> I was late to the discussion but I had some questions.
> >>>>>>>
> >>>>>>> JK01. I foresee scenarios where the cache leaves stale topic hashes
> >>>>>> around
> >>>>>>> when a group leaves.
> >>>>>>> Should we maintain a counter for the number of groups subscribed
> to the
> >>>>>>> topic (hash) to only keep active topics?
> >>>>>>>
> >>>>>>> JK02. Can you help me understand when we would actually be
> computing
> >>>>>>> group-level hashes, is it on every
> >>>>>>> heartbeat?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Jeff
> >>>>>>>
> >>>>>>> On Wed, Mar 12, 2025 at 11:33 PM PoAn Yang <yangp...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>>> Hi David,
> >>>>>>>>
> >>>>>>>> Yes, I will continue working on this.
> >>>>>>>>
> >>>>>>>> DJ02: As Kirk noted earlier, since the topic hash calculation is
> on
> >>>>>> server
> >>>>>>>> side (not client-side), we can safely introduce Guava as a
> dependency.
> >>>>>>>>
> >>>>>>>> If there is no other discussion, we can start voting in
> >>>>>>>> https://lists.apache.org/thread/j90htmphjk783p697jjfg1xt8thmy33p.
> We
> >>>>>> can
> >>>>>>>> discuss implementation details in PR
> >>>>>>>> https://github.com/apache/kafka/pull/17444. I will resolve merge
> >>>>>>>> conflicts by March 14th.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> PoAn
> >>>>>>>>
> >>>>>>>>> On Mar 13, 2025, at 12:43 AM, David Jacot <david.ja...@gmail.com
> >
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>> Hi PoAn,
> >>>>>>>>>
> >>>>>>>>> As we are getting closer to releasing 4.0, I think that we can
> resume
> >>>>>>>>> working on this one if you are still interested. Are you? I would
> >>>>> like
> >>>>>>>>> to have it in 4.1.
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> David
> >>>>>>>>>
> >>>>>>>>> On Wed, Jan 15, 2025 at 1:26 AM Kirk True <k...@kirktrue.pro>
> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Hi all,
> >>>>>>>>>>
> >>>>>>>>>> Hopefully a quick question...
> >>>>>>>>>>
> >>>>>>>>>> KT01. Will clients calculate the topic hash on the client?
> Based on
> >>>>>> the
> >>>>>>>> current state of the KIP and PR, I would have thought "no", but I
> ask
> >>>>>> based
> >>>>>>>> on the discussion around the possible use of Guava on client.
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Kirk
> >>>>>>>>>>
> >>>>>>>>>> On Mon, Jan 6, 2025, at 9:11 AM, David Jacot wrote:
> >>>>>>>>>>> Hi PoAn,
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the update. I haven't read the updated KIP yet.
> >>>>>>>>>>>
> >>>>>>>>>>> DJ02: I am not sure about using Guava as a dependency. I
> mentioned
> >>>>> it
> >>>>>>>> more
> >>>>>>>>>>> as an inspiration/reference. I suppose that we could use it on
> the
> >>>>>>>> server
> >>>>>>>>>>> but we should definitely not use it on the client. I am not
> sure
> >>>>> how
> >>>>>>>> others
> >>>>>>>>>>> feel about it.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> David
> >>>>>>>>>>>
> >>>>>>>>>>> On Mon, Jan 6, 2025 at 5:21 AM PoAn Yang <yangp...@gmail.com>
> >>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Chia-Ping / David / Lucas,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Happy new year and thanks for the review.
> >>>>>>>>>>>>
> >>>>>>>>>>>> DJ02: Thanks for the suggestion. I updated the PR to use
> Guava.
> >>>>>>>>>>>>
> >>>>>>>>>>>> DJ03: Yes, I updated the description to mention ISR change,
> >>>>>>>>>>>> add altering partition reassignment case, and mention that
> >>>>>>>>>>>> non-related topic change doesn’t trigger a rebalance.
> >>>>>>>>>>>> DJ03.1: Yes, I will keep using
> ModernGroup#requestMetadataRefresh
> >>>>>>>>>>>> to notify group.
> >>>>>>>>>>>>
> >>>>>>>>>>>> DJ06: Updated the PR to use Guava Hashing#combineUnordered
> >>>>>>>>>>>> function to combine topic hash.
> >>>>>>>>>>>>
> >>>>>>>>>>>> DJ07: Renamed it to MetadataHash.
> >>>>>>>>>>>>
> >>>>>>>>>>>> DJ08: Added a sample hash function to the KIP and use first
> byte
> >>>>> as
> >>>>>>>> magic
> >>>>>>>>>>>> byte. This is also included in latest PR.
> >>>>>>>>>>>>
> >>>>>>>>>>>> DJ09: Added two paragraphs about upgraded and downgraded.
> >>>>>>>>>>>>
> >>>>>>>>>>>> DJ10: According to Lucas’s comment, I add
> >>>>> StreamsGroupMetadataValue
> >>>>>>>> update
> >>>>>>>>>>>> to this KIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> PoAn
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Dec 20, 2024, at 3:58 PM, Chia-Ping Tsai <
> chia7...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> because assignors are sticky.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I forgot about that spec again :(
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> David Jacot <david.ja...@gmail.com> 於 2024年12月20日 週五
> 下午3:41寫道：
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Chia-Ping,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> DJ08: In my opinion, changing the format will be rare so it
> is
> >>>>>>>>>>>>>> acceptable if rebalances are triggered in this case on
> >>>>>>>>>>>>>> upgrade/downgrade. It is also what will happen if a cluster
> is
> >>>>>>>>>>>>>> downgraded from 4.1 (with this KIP) to 4.0. The rebalance
> won't
> >>>>>>>> change
> >>>>>>>>>>>>>> anything if the topology of the group is the same because
> >>>>>> assignors
> >>>>>>>>>>>>>> are sticky. The default ones are and we recommend custom
> ones to
> >>>>>>>> also
> >>>>>>>>>>>>>> be.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Dec 20, 2024 at 2:11 AM Chia-Ping Tsai <
> >>>>>> chia7...@apache.org
> >>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> ummm, it does not work for downgrade as the old
> coordinator has
> >>>>>> no
> >>>>>>>> idea
> >>>>>>>>>>>>>> about new format :(
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On 2024/12/20 00:57:27 Chia-Ping Tsai wrote:
> >>>>>>>>>>>>>>>> hi David
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ08:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> That's a good question. If the "hash" lacks version
> control,
> >>>>> it
> >>>>>>>> could
> >>>>>>>>>>>>>> trigger a series of unnecessary rebalances. However, adding
> >>>>>>>> additional
> >>>>>>>>>>>>>> information ("magic") to the hash does not help the upgraded
> >>>>>>>> coordinator
> >>>>>>>>>>>>>> determine the "version." This means that the upgraded
> >>>>> coordinator
> >>>>>>>> would
> >>>>>>>>>>>>>> still trigger unnecessary rebalances because it has no way
> to
> >>>>> know
> >>>>>>>> which
> >>>>>>>>>>>>>> format to use when comparing the hash.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Perhaps we can add a new field to
> ConsumerGroupMetadataValue
> >>>>> to
> >>>>>>>>>>>>>> indicate the version of the "hash." This would allow the
> >>>>>>>> coordinator,
> >>>>>>>>>>>> when
> >>>>>>>>>>>>>> handling subscription metadata, to compute the old hash and
> >>>>>>>> determine
> >>>>>>>>>>>>>> whether an epoch bump is necessary. Additionally, the
> >>>>> coordinator
> >>>>>>>> can
> >>>>>>>>>>>>>> generate a new record to upgrade the hash without requiring
> an
> >>>>>> epoch
> >>>>>>>>>>>> bump.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another issue is whether the coordinator should cache all
> >>>>>>>> versions of
> >>>>>>>>>>>>>> the hash. I believe this is necessary; otherwise, during an
> >>>>>> upgrade,
> >>>>>>>>>>>> there
> >>>>>>>>>>>>>> would be extensive recomputing of old hashes.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I believe this idea should also work for downgrades, and
> >>>>> that's
> >>>>>>>> just
> >>>>>>>>>>>>>> my two cents.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>> Chia-Ping
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On 2024/12/19 14:39:41 David Jacot wrote:
> >>>>>>>>>>>>>>>>> Hi PoAn and Chia-Ping,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Thanks for your responses.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ02: Sorry, I was not clear. I was wondering whether we
> >>>>> could
> >>>>>>>>>>>>>> compute the
> >>>>>>>>>>>>>>>>> hash without having to convert to bytes before. Guava
> has a
> >>>>>> nice
> >>>>>>>>>>>>>> interface
> >>>>>>>>>>>>>>>>> for this allowing to incrementally add primitive types
> to the
> >>>>>>>> hash.
> >>>>>>>>>>>>>> We can
> >>>>>>>>>>>>>>>>> discuss this in the PR as it is an implementation detail.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ03: Thanks. I don't think that the replicas are updated
> >>>>> when
> >>>>>> a
> >>>>>>>>>>>>>> broker
> >>>>>>>>>>>>>>>>> shuts down. What you said applies to the ISR. I suppose
> that
> >>>>> we
> >>>>>>>> can
> >>>>>>>>>>>>>> rely on
> >>>>>>>>>>>>>>>>> the ISR changes to trigger updates. It is also worth
> noting
> >>>>>>>>>>>>>>>>> that TopicsDelta#changedTopics is updated for every
> change
> >>>>>> (e.g.
> >>>>>>>> ISR
> >>>>>>>>>>>>>>>>> change, leader change, replicas change, etc.). I suppose
> that
> >>>>>> it
> >>>>>>>> is
> >>>>>>>>>>>>>> OK but
> >>>>>>>>>>>>>>>>> it seems that it will trigger refreshes which are not
> >>>>>> necessary.
> >>>>>>>>>>>>>> However, a
> >>>>>>>>>>>>>>>>> rebalance won't be triggered because the hash won't
> change.
> >>>>>>>>>>>>>>>>> DJ03.1: I suppose that we will continue to rely on
> >>>>>>>>>>>>>>>>> ModernGroup#requestMetadataRefresh to notify groups that
> must
> >>>>>>>>>>>>>> refresh their
> >>>>>>>>>>>>>>>>> hashes. Is my understanding correct?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ05: Fair enough.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ06: You mention in two places that you would like to
> >>>>> combine
> >>>>>>>>>>>>>> hashes by
> >>>>>>>>>>>>>>>>> additioning them. I wonder if this is a good practice.
> >>>>>>>> Intuitively,
> >>>>>>>>>>>>>> I would
> >>>>>>>>>>>>>>>>> have used XOR or hashed the hashed. Guava has a method
> for
> >>>>>>>> combining
> >>>>>>>>>>>>>>>>> hashes. It may be worth looking into the algorithm used.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ07: I would rename "AllTopicHash" to "MetadataHash" in
> >>>>> order
> >>>>>>>> to be
> >>>>>>>>>>>>>> more
> >>>>>>>>>>>>>>>>> generic.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ08: Regarding the per topic hash, I wonder whether we
> >>>>> should
> >>>>>>>>>>>>>> precise in
> >>>>>>>>>>>>>>>>> the KIP how we will compute it. I had the following in
> mind:
> >>>>>>>>>>>>>>>>> hash(topicName; numPartitions; [partitionId;sorted
> racks]).
> >>>>> We
> >>>>>>>> could
> >>>>>>>>>>>>>> also
> >>>>>>>>>>>>>>>>> add a magic byte at the first element as a sort of
> version. I
> >>>>>> am
> >>>>>>>> not
> >>>>>>>>>>>>>> sure
> >>>>>>>>>>>>>>>>> whether it is needed though. I was thinking about this
> while
> >>>>>>>>>>>>>> imagining how
> >>>>>>>>>>>>>>>>> we would handle changing the format in the future.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ09: It would be great if we could provide more details
> >>>>> about
> >>>>>>>>>>>>>> backward
> >>>>>>>>>>>>>>>>> compatibility. What happens when the cluster is upgraded
> or
> >>>>>>>>>>>>>> downgraded?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> DJ10: We should update KIP-1071. It may be worth pigging
> them
> >>>>>> in
> >>>>>>>> the
> >>>>>>>>>>>>>>>>> discussion thread of KIP-1071.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Tue, Dec 17, 2024 at 9:25 AM PoAn Yang <
> >>>>> yangp...@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Chia-Ping / David / Andrew,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Thanks for the review and suggestions.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> DJ01: Removed all implementation details.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> DJ02: Does the “incrementally” mean that we only
> calculate
> >>>>> the
> >>>>>>>>>>>>>> difference
> >>>>>>>>>>>>>>>>>> parts?
> >>>>>>>>>>>>>>>>>> For example, if the number of partition change, we only
> >>>>>>>> calculate
> >>>>>>>>>>>>>> the hash
> >>>>>>>>>>>>>>>>>> of number of partition and reconstruct it to the topic
> hash.
> >>>>>>>>>>>>>>>>>> IMO, we only calculate topic hash one time. With cache
> >>>>>>>> mechanism,
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> value can be reused in different groups on a same
> broker.
> >>>>>>>>>>>>>>>>>> The CPU usage for this part is not very high.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> DJ03: Added the update path to KIP for both cases.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> DJ04: Yes, it’s a good idea. With cache mechanism and
> single
> >>>>>>>> hash
> >>>>>>>>>>>>>> per
> >>>>>>>>>>>>>>>>>> group, we can balance cpu and disk usage.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> DJ05: Currently, the topic hash is only used in
> coordinator.
> >>>>>>>>>>>>>> However, the
> >>>>>>>>>>>>>>>>>> metadata image is used in many different places.
> >>>>>>>>>>>>>>>>>> How about we move the hash to metadata image when we
> find
> >>>>> more
> >>>>>>>> use
> >>>>>>>>>>>>>> cases?
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> AS1, AS2: Thanks for the reminder. I will simply delete
> >>>>>>>>>>>>>>>>>> ShareGroupPartitionMetadataKey/Value and add a new
> field to
> >>>>>>>>>>>>>>>>>> ShareGroupMetadataValue.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>> PoAn
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Dec 17, 2024, at 5:50 AM, Andrew Schofield <
> >>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hi PoAn,
> >>>>>>>>>>>>>>>>>>> Thanks for the KIP.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> AS1: From the point of view of share groups, the API
> and
> >>>>>> record
> >>>>>>>>>>>>>> schema
> >>>>>>>>>>>>>>>>>>> definitions are unstable in AK 4.0. In AK 4.1, we will
> >>>>> start
> >>>>>>>>>>>>>> supporting
> >>>>>>>>>>>>>>>>>> proper
> >>>>>>>>>>>>>>>>>>> versioning. As a result, I think you do not need to
> >>>>> deprecate
> >>>>>>>>>>>>>> the fields
> >>>>>>>>>>>>>>>>>> in the
> >>>>>>>>>>>>>>>>>>> ShareGroupPartitionMetadataValue. Just include the
> schema
> >>>>> for
> >>>>>>>>>>>>>> the fields
> >>>>>>>>>>>>>>>>>>> which are actually needed, and I'll update the schema
> in
> >>>>> the
> >>>>>>>>>>>>>> code when
> >>>>>>>>>>>>>>>>>>> the KIP is implemented.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> AS2: In the event that DJ04 actually removes the need
> for
> >>>>>>>>>>>>>>>>>>> ConsumerGroupPartitionMetadataKey/Value entirely, I
> would
> >>>>>>>> simply
> >>>>>>>>>>>>>>>>>>> delete ShareGroupPartitionMetadataKey/Value, assuming
> that
> >>>>> it
> >>>>>>>> is
> >>>>>>>>>>>>>>>>>>> accepted in time for AK 4.1.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>>>>>>> Andrew
> >>>>>>>>>>>>>>>>>>> ________________________________________
> >>>>>>>>>>>>>>>>>>> From: Chia-Ping Tsai <chia7...@apache.org>
> >>>>>>>>>>>>>>>>>>> Sent: 16 December 2024 16:27
> >>>>>>>>>>>>>>>>>>> To: dev@kafka.apache.org <dev@kafka.apache.org>
> >>>>>>>>>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-1101: Trigger rebalance on
> rack
> >>>>>>>>>>>>>> topology
> >>>>>>>>>>>>>>>>>> changes
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> hi David
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> DJ05
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> One of the benefits of having a single hash per group
> >>>>> (DJ04)
> >>>>>> is
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> reduction in the size of stored data. Additionally, the
> cost
> >>>>>> of
> >>>>>>>>>>>>>>>>>> re-computing can be minimized thanks to caching. So I'm
> + 1
> >>>>> to
> >>>>>>>>>>>>>> DJ04.
> >>>>>>>>>>>>>>>>>> However, the advantage of storing the topic cache in the
> >>>>>>>> metadata
> >>>>>>>>>>>>>> image is
> >>>>>>>>>>>>>>>>>> somewhat unclear to me. Could you please provide more
> >>>>> details
> >>>>>> on
> >>>>>>>>>>>>>> what you
> >>>>>>>>>>>>>>>>>> mean by "tight"?Furthermore, since the metadata image
> is a
> >>>>>>>>>>>>>> thread-safe
> >>>>>>>>>>>>>>>>>> object, we need to ensure that the lazy initialization
> is
> >>>>> also
> >>>>>>>>>>>>>> thread-safe.
> >>>>>>>>>>>>>>>>>> If no other components require the cache, it would be
> better
> >>>>>> to
> >>>>>>>>>>>>>> keep the
> >>>>>>>>>>>>>>>>>> caches within the coordinator.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>> Chia-Ping
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On 2024/12/16 14:01:35 David Jacot wrote:
> >>>>>>>>>>>>>>>>>>>> Hi PoAn,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks for the KIP. I have some comments about it.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> DJ01: Please, remove all the code from the KIP. We
> only
> >>>>> care
> >>>>>>>>>>>>>> about
> >>>>>>>>>>>>>>>>>> public
> >>>>>>>>>>>>>>>>>>>> interface changes, not about implementation details.
> >>>>>>>>>>>>>>>>>>>> DJ02: Regarding the hash computation, I agree that we
> >>>>> should
> >>>>>>>> use
> >>>>>>>>>>>>>>>>>> Murmur3.
> >>>>>>>>>>>>>>>>>>>> However, I don't quite like the implementation that
> you
> >>>>>>>> shared.
> >>>>>>>>>>>>>> I
> >>>>>>>>>>>>>>>>>> wonder if
> >>>>>>>>>>>>>>>>>>>> we could make it work incrementally instead of
> computing a
> >>>>>>>> hash
> >>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> everything and combining them.
> >>>>>>>>>>>>>>>>>>>> DJ03: Regarding the cache, my understanding is that
> the
> >>>>>> cache
> >>>>>>>> is
> >>>>>>>>>>>>>>>>>> populated
> >>>>>>>>>>>>>>>>>>>> when a topic without hash is seen in a HB request and
> the
> >>>>>>>> cache
> >>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> cleaned
> >>>>>>>>>>>>>>>>>>>> up when topics are deleted based on the metadata
> image.
> >>>>>>>>>>>>>> However, the
> >>>>>>>>>>>>>>>>>> update
> >>>>>>>>>>>>>>>>>>>> path is not clear. Let's say that a partition is
> added to
> >>>>> a
> >>>>>>>>>>>>>> topic, how
> >>>>>>>>>>>>>>>>>> does
> >>>>>>>>>>>>>>>>>>>> it detect it? Let's also imagine that the racks of a
> >>>>>> partition
> >>>>>>>>>>>>>> have
> >>>>>>>>>>>>>>>>>>>> changed, how does it detect it? In the KIP, it would
> be
> >>>>> nice
> >>>>>>>> to
> >>>>>>>>>>>>>> be clear
> >>>>>>>>>>>>>>>>>>>> about those.
> >>>>>>>>>>>>>>>>>>>> DJ04: I wonder whether we should go with a single
> hash per
> >>>>>>>>>>>>>> group. Your
> >>>>>>>>>>>>>>>>>>>> argument against it is that it would require to
> re-compute
> >>>>>> the
> >>>>>>>>>>>>>> hash of
> >>>>>>>>>>>>>>>>>> all
> >>>>>>>>>>>>>>>>>>>> the topics when it needs to be computed. In my
> opinion, we
> >>>>>>>> could
> >>>>>>>>>>>>>>>>>> leverage
> >>>>>>>>>>>>>>>>>>>> the cached hash per topic to compute the hash of all
> the
> >>>>>>>>>>>>>> subscribed
> >>>>>>>>>>>>>>>>>> ones.
> >>>>>>>>>>>>>>>>>>>> We could basically combine all the hashes without
> having
> >>>>> to
> >>>>>>>>>>>>>> compute all
> >>>>>>>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>>>>>> them. This approach has a few benefits. 1) We could
> get
> >>>>> rid
> >>>>>> of
> >>>>>>>>>>>>>>>>>>>> the ConsumerGroupPartitionMetadata record as we could
> >>>>> store
> >>>>>>>> the
> >>>>>>>>>>>>>> hash
> >>>>>>>>>>>>>>>>>> with
> >>>>>>>>>>>>>>>>>>>> the group epoch. 2) We could get rid of the Map that
> we
> >>>>> keep
> >>>>>>>> in
> >>>>>>>>>>>>>> each
> >>>>>>>>>>>>>>>>>> group
> >>>>>>>>>>>>>>>>>>>> to store the hashed corresponding to the subscribed
> >>>>> topics.
> >>>>>>>>>>>>>>>>>>>> DJ05: Regarding the cache again, I wonder if we should
> >>>>>>>> actually
> >>>>>>>>>>>>>> store
> >>>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>>> hash in the metadata image instead of maintaining it
> >>>>>> somewhere
> >>>>>>>>>>>>>> else. We
> >>>>>>>>>>>>>>>>>>>> could still lazily compute it. The benefit is that the
> >>>>> value
> >>>>>>>>>>>>>> would be
> >>>>>>>>>>>>>>>>>> tight
> >>>>>>>>>>>>>>>>>>>> to the topic. I have not really looked into it. Would
> it
> >>>>> be
> >>>>>> an
> >>>>>>>>>>>>>> option?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> I'll be away for two weeks starting from Saturday. I
> >>>>> kindly
> >>>>>>>> ask
> >>>>>>>>>>>>>> you to
> >>>>>>>>>>>>>>>>>> wait
> >>>>>>>>>>>>>>>>>>>> on me if we cannot conclude this week.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>> David
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2024 at 1:43 PM Frank Yang <
> >>>>>> yangp...@gmail.com
> >>>>>>>>>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Hi Chia-Ping,
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks for the review and suggestions.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Q0: Add how rack change and how it affects topic
> >>>>> partition.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Q1: Add why we need a balance algorithm to Motivation
> >>>>>>>> section.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Q2: After checking again, we don’t need to update
> cache
> >>>>>> when
> >>>>>>>>>>>>>> we replay
> >>>>>>>>>>>>>>>>>>>>> records. We only need to renew it in consumer
> heartbeat.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Q3: Add a new section “Topic Hash Function”.
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>>>>>>>> PoAn
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> On Nov 1, 2024, at 4:39 PM, Chia-Ping Tsai <
> >>>>>>>>>>>>>> chia7...@gmail.com>
> >>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> hi PoAn
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Thanks for for this KIP!
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Q0: Could you add more details about `A topic
> partition
> >>>>>> has
> >>>>>>>>>>>>>> rack
> >>>>>>>>>>>>>>>>>> change`?
> >>>>>>>>>>>>>>>>>>>>>> IIRC, the "rack change" includes both follower and
> >>>>> leader,
> >>>>>>>>>>>>>> right?
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Q1: Could you please add the 'concerns' we
> discussed to
> >>>>>> the
> >>>>>>>>>>>>>> Motivation
> >>>>>>>>>>>>>>>>>>>>>> section? This should include topics like
> 'computations'
> >>>>>> and
> >>>>>>>>>>>>>> 'space
> >>>>>>>>>>>>>>>>>>>>> usage'.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Q2: `The group coordinator can leverage it to add a
> new
> >>>>>>>> topic
> >>>>>>>>>>>>>>>>>> hash.`This
> >>>>>>>>>>>>>>>>>>>>>> description seems a bit off to me. Why do we need to
> >>>>>> update
> >>>>>>>>>>>>>> the cache
> >>>>>>>>>>>>>>>>>> at
> >>>>>>>>>>>>>>>>>>>>>> this phase? The cache is intended to prevent
> duplicate
> >>>>>>>>>>>>>> computations
> >>>>>>>>>>>>>>>>>>>>> caused
> >>>>>>>>>>>>>>>>>>>>>> by heartbeat requests that occur between two
> metadata
> >>>>>> change
> >>>>>>>>>>>>>> events.
> >>>>>>>>>>>>>>>>>>>>>> Therefore, we could even remove the changed topics
> from
> >>>>>>>>>>>>>> caches on a
> >>>>>>>>>>>>>>>>>>>>>> metadata change, as the first heartbeat request
> would
> >>>>>> update
> >>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> caches
> >>>>>>>>>>>>>>>>>>>>> for
> >>>>>>>>>>>>>>>>>>>>>> all changed topics.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Q3: Could you please include a section about the
> choice
> >>>>> of
> >>>>>>>>>>>>>> hash
> >>>>>>>>>>>>>>>>>>>>>> implementation? The hash implementation must be
> >>>>> consistent
> >>>>>>>>>>>>>> across
> >>>>>>>>>>>>>>>>>>>>> different
> >>>>>>>>>>>>>>>>>>>>>> JDKs, so we use Murmur3 to generate the hash value.
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>>>>>>>>> Chia-Ping
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>> Frank Yang <yangp...@gmail.com> 於 2024年11月1日 週五
> >>>>> 下午3:57寫道：
> >>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> I would like to start a discussion thread on
> KIP-1101.
> >>>>>>>>>>>>>> Trigger
> >>>>>>>>>>>>>>>>>> rebalance
> >>>>>>>>>>>>>>>>>>>>>>> on rack topology changes. In this KIP, we aim to
> use
> >>>>> less
> >>>>>>>>>>>>>> memory /
> >>>>>>>>>>>>>>>>>> disk
> >>>>>>>>>>>>>>>>>>>>>>> resources to detect rack changes in the new
> >>>>> coordinator.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1101%3A+Trigger+rebalance+on+rack+topology+changes
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Please take a look and feel free to share any
> thoughts.
> >>>>>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>>>> Thanks.
> >>>>>>>>>>>>>>>>>>>>>>> PoAn
>
>

Re: [DISCUSS] KIP-1101: Trigger rebalance on rack topology changes

Reply via email to