Re: [DISCUSS] KIP-1101: Trigger rebalance on rack topology changes

PoAn Yang Thu, 17 Apr 2025 00:39:19 -0700

Hi David,

Thanks for the review. Appreciated it.


DJ05: Let's discuss more in the pull request.

DJ07: Add section “Bump Group Version” to “Rejected Alternatives”.

Thanks,
PoAn

> On Apr 16, 2025, at 4:34 PM, David Jacot <dja...@confluent.io.INVALID> wrote:
> 
> Hi PoAn,
> 
> Thanks for your very detailed answers!
> 
> DJ05: This seems reasonable. We can look into it more deeply during the
> implementation too.
> 
> DJ07: Yeah, I see your point. I don't have a strict opinion on this so I am
> happy to go with the majority. Could you please at least mention in the
> rejected alternative section? It would be good to have a trace of it.
> 
> Best,
> David
> 
> On Wed, Apr 16, 2025 at 4:23 AM PoAn Yang <yangp...@gmail.com 
> <mailto:yangp...@gmail.com>> wrote:
> 
>> Hi Lucas,
>> 
>> Thank you. Here is vote thread
>> https://lists.apache.org/thread/j90htmphjk783p697jjfg1xt8thmy33p.
>> 
>> Regards,
>> PoAn
>> 
>>> On Apr 16, 2025, at 12:07 AM, Lucas Brutschy 
>>> <lbruts...@confluent.io.INVALID>
>> wrote:
>>> 
>>> Hi PoAn,
>>> 
>>> from my side, we are good to move to a vote on this one.
>>> 
>>> Cheers,
>>> Lucas
>>> 
>>> On Thu, Apr 10, 2025 at 11:29 AM PoAn Yang <yangp...@gmail.com 
>>> <mailto:yangp...@gmail.com><mailto:
>> yangp...@gmail.com <mailto:yangp...@gmail.com>>> wrote:
>>>> 
>>>> Hi David and Lucas,
>>>> 
>>>> Thanks for the review and great suggestion.
>>>> 
>>>> DJ01: Add the statement about rack-aware rebalance was removed in 4.0.
>>>> 
>>>> DJ02: Change ConsumerGroupPartitionMetadataKey and
>>>> ConsumerGroupPartitionMetadataValue from removed to deprecated.
>>>> 
>>>> DJ03: In my sample implementation <
>> https://github.com/apache/kafka/pull/17444>, the cache is maintained in
>> GroupCoordinatorManager.
>>>> A GroupCoordinatorShard only have one GroupCoordinatorManager, so I
>> think it’s similar to
>>>> a cache per GroupCoordinatorShard instance.
>>>> 
>>>> DJ04: Add the order about how topic data will be computed to get a
>> topic hash.
>>>> 
>>>> DJ05: This is a good question. In topic hash, the Guava uses “final
>> mix” to ensure avalanche
>>>> effect. However for the single hash of subscribed topics in a group, we
>> cannot use simple
>>>> sum, because we need to avoid the edge case. For example, hash(a+b) and
>> hash(b+a) is
>>>> same. To fix this case, the hash function sorts topics by name and the
>> topic hash multiplied
>>>> by the position value like hash(a + 2b) and hash(b + 2a). I add this
>> case and regular expression case to the KIP.
>>>> 
>>>> DJ06: It’s good to add tombstone of ConsumerGroupPartitionMetadata, so
>> log compaction
>>>> can remove old data. Add this part to compatibility section.
>>>> 
>>>> DJ07: In previous discussion, you mentioned that assignors are sticky.
>> If the topology of the
>>>> group is the same, the upgrade/downgrade rebalance doesn’t change
>> anything.
>>>> 
>>>>>>>>>>>>>> DJ08: In my opinion, changing the format will be rare so it is
>>>>>>>>>>>>>> acceptable if rebalances are triggered in this case on
>>>>>>>>>>>>>> upgrade/downgrade. It is also what will happen if a cluster is
>>>>>>>>>>>>>> downgraded from 4.1 (with this KIP) to 4.0. The rebalance
>> won't
>>>>>>>> change
>>>>>>>>>>>>>> anything if the topology of the group is the same because
>>>>>> assignors
>>>>>>>>>>>>>> are sticky. The default ones are and we recommend custom ones
>> to
>>>>>>>> also
>>>>>>>>>>>>>> be.
>>>> 
>>>> With `group.version` change, we need to maintain two code in 4.1. One
>> is subscription
>>>> metadata without rack and another is group hash. Another drawback is
>> that changing
>>>> `group.version` will trigger all group rebalance at the same time.
>> IMHO, I would prefer
>>>> to keep the code simple.
>>>> 
>>>> DJ08: Add ConsumerGroupHeartbeatRequestTest to test plan.
>>>> 
>>>> LB01: Add the removal of StreamGroupPartitionMetadataKey /
>>>> StreamGrouprtitionMetadataValue to public interface change section.
>>>> 
>>>> Thank you.
>>>> 
>>>> Regards,
>>>> PoAn
>>>> 
>>>>> On Apr 9, 2025, at 5:11 PM, Lucas Brutschy 
>>>>> <lbruts...@confluent.io.INVALID <mailto:lbruts...@confluent.io.INVALID>>
>> wrote:
>>>>> 
>>>>> Hi PoAn,
>>>>> 
>>>>> LB01: please add the removal of StreamsGroupPartitionMetadataKey /
>>>>> StreamsGroupPartitionMetadataValue to the KIP. Since these were not
>>>>> yet included in 4.0, we can just remove them as long as this KIP hits
>>>>> 4.1. I can help with the integration into the `streamsGroupHeartbeat`,
>>>>> as it will be a little different, but we do not need to call this out
>>>>> in the KIP.
>>>>> 
>>>>> Thanks for your work on this!
>>>>> 
>>>>> Cheers,
>>>>> Lucas
>>>>> 
>>>>> On Wed, Apr 9, 2025 at 9:59 AM David Jacot <dja...@confluent.io.invalid 
>>>>> <mailto:dja...@confluent.io.invalid>>
>> wrote:
>>>>>> 
>>>>>> Hi PoAn,
>>>>>> 
>>>>>> I finally had a bit of time to review your KIP. First, let me say
>> that the
>>>>>> KIP is well written! I have some more comments:
>>>>>> 
>>>>>> DJ01: In the motivation, could you please mention that we actually
>> removed
>>>>>> triggering rebalances based on rack changes in 4.0? You explained
>> very well
>>>>>> why we removed it but it is not clear that we removed it.
>>>>>> 
>>>>>> DJ02: We cannot delete ConsumerGroupPartitionMetadataKey
>>>>>> and ConsumerGroupPartitionMetadataValue because the coordinator must
>> still
>>>>>> be able to read the record from the log when the cluster is upgraded.
>>>>>> However, we can deprecate them.
>>>>>> 
>>>>>> DJ03: Could you clarify how the cache will be scoped? My
>> understanding is
>>>>>> that we will maintain a cache per GroupCoordinatorShard instance. Is
>> this
>>>>>> correct?
>>>>>> 
>>>>>> DJ04: Regarding the topic hash function, would it be possible to
>> specify
>>>>>> how the hash will be computed instead of providing the signature of
>> the
>>>>>> method?
>>>>>> 
>>>>>> DJ05: I have a general question about how you propose to combine
>> hashes. Is
>>>>>> using a sum enough? I was looking at the Guava implementation and
>> they also
>>>>>> apply a "final mix" to ensure an avalanche effect. It would be great
>> to
>>>>>> elaborate a bit more on this in the KIP.
>>>>>> 
>>>>>> DJ06: After the upgrade, we need to ensure that a tombstone is
>> written for
>>>>>> the ConsumerGroupPartitionMetadata record if it was present. I
>> suppose that
>>>>>> this is something that we could do when the next metadata hash is
>> computed.
>>>>>> Have you thought about it?
>>>>>> 
>>>>>> DJ07: I wonder whether we should gate the change with
>> `group.version`. My
>>>>>> concern is that during upgrade, not all the coordinators will support
>> the
>>>>>> new way and groups may bounce between those coordinators as their
>>>>>> underlying __consumer_offsets partitions are moved within the cluster
>>>>>> (during the roll). This may trigger unnecessary rebalances. One way to
>>>>>> avoid this is to gate the change with `group.version` which is only
>> bumped
>>>>>> at the end of the cluster upgrade process. What do you think?
>>>>>> 
>>>>>> DJ08: Regarding the test plan, we should also extend
>>>>>> ConsumerGroupHeartbeatRequestTest to cover the new feature.
>>>>>> 
>>>>>> Thanks for your hard work on this one. Let's try to get it in 4.1.
>>>>>> 
>>>>>> Best,
>>>>>> David
>>>>>> 
>>>>>> On Mon, Mar 24, 2025 at 8:17 PM Jeff Kim
>> <jeff....@confluent.io.invalid <mailto:jeff....@confluent.io.invalid>>
>>>>>> wrote:
>>>>>> 
>>>>>>> Hi PoAn,
>>>>>>> 
>>>>>>> Thanks for the clarification!
>>>>>>> 
>>>>>>> Best,
>>>>>>> Jeff
>>>>>>> 
>>>>>>> On Mon, Mar 24, 2025 at 11:43 AM PoAn Yang <yangp...@gmail.com 
>>>>>>> <mailto:yangp...@gmail.com>>
>> wrote:
>>>>>>> 
>>>>>>>> Hi Jeff,
>>>>>>>> 
>>>>>>>> Thanks for the review.
>>>>>>>> 
>>>>>>>> JK01. Yes, we already maintain `groupByTopics` data structure in the
>>>>>>>> coordinator. If a group leaves,
>>>>>>>> it checks whether a topic is subscribed by any group. If it’s not,
>> remove
>>>>>>>> topic hash from the cache.
>>>>>>>> Please check the sample implementation about
>>>>>>>> GroupMetadataManager#unsubscribeGroupFromTopic
>>>>>>>> in #17444 <https://github.com/apache/kafka/pull/17444>.
>>>>>>>> 
>>>>>>>> JK02: This KIP replaces subscription metadata by a group hash, so
>> the
>>>>>>>> coordinator calculates group
>>>>>>>> hash where we calculated subscription metadata. The trigger points
>> are
>>>>>>>> like a group member is updated,
>>>>>>>> , new group member, new subscribed topic names, new regular
>> expressions,
>>>>>>>> and new metadata (topic change).
>>>>>>>> 
>>>>>>>> Best Regards,
>>>>>>>> PoAn
>>>>>>>> 
>>>>>>>>> On Mar 22, 2025, at 3:25 AM, Jeff Kim
>> <jeff....@confluent.io.INVALID <mailto:jeff....@confluent.io.INVALID>>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi PoAn,
>>>>>>>>> 
>>>>>>>>> I was late to the discussion but I had some questions.
>>>>>>>>> 
>>>>>>>>> JK01. I foresee scenarios where the cache leaves stale topic hashes
>>>>>>>> around
>>>>>>>>> when a group leaves.
>>>>>>>>> Should we maintain a counter for the number of groups subscribed
>> to the
>>>>>>>>> topic (hash) to only keep active topics?
>>>>>>>>> 
>>>>>>>>> JK02. Can you help me understand when we would actually be
>> computing
>>>>>>>>> group-level hashes, is it on every
>>>>>>>>> heartbeat?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Jeff
>>>>>>>>> 
>>>>>>>>> On Wed, Mar 12, 2025 at 11:33 PM PoAn Yang <yangp...@gmail.com 
>>>>>>>>> <mailto:yangp...@gmail.com>>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi David,
>>>>>>>>>> 
>>>>>>>>>> Yes, I will continue working on this.
>>>>>>>>>> 
>>>>>>>>>> DJ02: As Kirk noted earlier, since the topic hash calculation is
>> on
>>>>>>>> server
>>>>>>>>>> side (not client-side), we can safely introduce Guava as a
>> dependency.
>>>>>>>>>> 
>>>>>>>>>> If there is no other discussion, we can start voting in
>>>>>>>>>> https://lists.apache.org/thread/j90htmphjk783p697jjfg1xt8thmy33p.
>> We
>>>>>>>> can
>>>>>>>>>> discuss implementation details in PR
>>>>>>>>>> https://github.com/apache/kafka/pull/17444. I will resolve merge
>>>>>>>>>> conflicts by March 14th.
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> PoAn
>>>>>>>>>> 
>>>>>>>>>>> On Mar 13, 2025, at 12:43 AM, David Jacot <david.ja...@gmail.com
>>> 
>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi PoAn,
>>>>>>>>>>> 
>>>>>>>>>>> As we are getting closer to releasing 4.0, I think that we can
>> resume
>>>>>>>>>>> working on this one if you are still interested. Are you? I would
>>>>>>> like
>>>>>>>>>>> to have it in 4.1.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> David
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jan 15, 2025 at 1:26 AM Kirk True <k...@kirktrue.pro>
>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>> 
>>>>>>>>>>>> Hopefully a quick question...
>>>>>>>>>>>> 
>>>>>>>>>>>> KT01. Will clients calculate the topic hash on the client?
>> Based on
>>>>>>>> the
>>>>>>>>>> current state of the KIP and PR, I would have thought "no", but I
>> ask
>>>>>>>> based
>>>>>>>>>> on the discussion around the possible use of Guava on client.
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Kirk
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Jan 6, 2025, at 9:11 AM, David Jacot wrote:
>>>>>>>>>>>>> Hi PoAn,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks for the update. I haven't read the updated KIP yet.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> DJ02: I am not sure about using Guava as a dependency. I
>> mentioned
>>>>>>> it
>>>>>>>>>> more
>>>>>>>>>>>>> as an inspiration/reference. I suppose that we could use it on
>> the
>>>>>>>>>> server
>>>>>>>>>>>>> but we should definitely not use it on the client. I am not
>> sure
>>>>>>> how
>>>>>>>>>> others
>>>>>>>>>>>>> feel about it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> David
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Mon, Jan 6, 2025 at 5:21 AM PoAn Yang <yangp...@gmail.com>
>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Chia-Ping / David / Lucas,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Happy new year and thanks for the review.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> DJ02: Thanks for the suggestion. I updated the PR to use
>> Guava.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> DJ03: Yes, I updated the description to mention ISR change,
>>>>>>>>>>>>>> add altering partition reassignment case, and mention that
>>>>>>>>>>>>>> non-related topic change doesn’t trigger a rebalance.
>>>>>>>>>>>>>> DJ03.1: Yes, I will keep using
>> ModernGroup#requestMetadataRefresh
>>>>>>>>>>>>>> to notify group.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> DJ06: Updated the PR to use Guava Hashing#combineUnordered
>>>>>>>>>>>>>> function to combine topic hash.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> DJ07: Renamed it to MetadataHash.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> DJ08: Added a sample hash function to the KIP and use first
>> byte
>>>>>>> as
>>>>>>>>>> magic
>>>>>>>>>>>>>> byte. This is also included in latest PR.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> DJ09: Added two paragraphs about upgraded and downgraded.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> DJ10: According to Lucas’s comment, I add
>>>>>>> StreamsGroupMetadataValue
>>>>>>>>>> update
>>>>>>>>>>>>>> to this KIP.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> PoAn
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Dec 20, 2024, at 3:58 PM, Chia-Ping Tsai <
>> chia7...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> because assignors are sticky.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I forgot about that spec again :(
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> David Jacot <david.ja...@gmail.com> 於 2024年12月20日 週五
>> 下午3:41寫道：
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Chia-Ping,
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> DJ08: In my opinion, changing the format will be rare so it
>> is
>>>>>>>>>>>>>>>> acceptable if rebalances are triggered in this case on
>>>>>>>>>>>>>>>> upgrade/downgrade. It is also what will happen if a cluster
>> is
>>>>>>>>>>>>>>>> downgraded from 4.1 (with this KIP) to 4.0. The rebalance
>> won't
>>>>>>>>>> change
>>>>>>>>>>>>>>>> anything if the topology of the group is the same because
>>>>>>>> assignors
>>>>>>>>>>>>>>>> are sticky. The default ones are and we recommend custom
>> ones to
>>>>>>>>>> also
>>>>>>>>>>>>>>>> be.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Dec 20, 2024 at 2:11 AM Chia-Ping Tsai <
>>>>>>>> chia7...@apache.org
>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> ummm, it does not work for downgrade as the old
>> coordinator has
>>>>>>>> no
>>>>>>>>>> idea
>>>>>>>>>>>>>>>> about new format :(
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 2024/12/20 00:57:27 Chia-Ping Tsai wrote:
>>>>>>>>>>>>>>>>>> hi David
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ08:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> That's a good question. If the "hash" lacks version
>> control,
>>>>>>> it
>>>>>>>>>> could
>>>>>>>>>>>>>>>> trigger a series of unnecessary rebalances. However, adding
>>>>>>>>>> additional
>>>>>>>>>>>>>>>> information ("magic") to the hash does not help the upgraded
>>>>>>>>>> coordinator
>>>>>>>>>>>>>>>> determine the "version." This means that the upgraded
>>>>>>> coordinator
>>>>>>>>>> would
>>>>>>>>>>>>>>>> still trigger unnecessary rebalances because it has no way
>> to
>>>>>>> know
>>>>>>>>>> which
>>>>>>>>>>>>>>>> format to use when comparing the hash.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Perhaps we can add a new field to
>> ConsumerGroupMetadataValue
>>>>>>> to
>>>>>>>>>>>>>>>> indicate the version of the "hash." This would allow the
>>>>>>>>>> coordinator,
>>>>>>>>>>>>>> when
>>>>>>>>>>>>>>>> handling subscription metadata, to compute the old hash and
>>>>>>>>>> determine
>>>>>>>>>>>>>>>> whether an epoch bump is necessary. Additionally, the
>>>>>>> coordinator
>>>>>>>>>> can
>>>>>>>>>>>>>>>> generate a new record to upgrade the hash without requiring
>> an
>>>>>>>> epoch
>>>>>>>>>>>>>> bump.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Another issue is whether the coordinator should cache all
>>>>>>>>>> versions of
>>>>>>>>>>>>>>>> the hash. I believe this is necessary; otherwise, during an
>>>>>>>> upgrade,
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>>> would be extensive recomputing of old hashes.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> I believe this idea should also work for downgrades, and
>>>>>>> that's
>>>>>>>>>> just
>>>>>>>>>>>>>>>> my two cents.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>> Chia-Ping
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On 2024/12/19 14:39:41 David Jacot wrote:
>>>>>>>>>>>>>>>>>>> Hi PoAn and Chia-Ping,
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Thanks for your responses.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ02: Sorry, I was not clear. I was wondering whether we
>>>>>>> could
>>>>>>>>>>>>>>>> compute the
>>>>>>>>>>>>>>>>>>> hash without having to convert to bytes before. Guava
>> has a
>>>>>>>> nice
>>>>>>>>>>>>>>>> interface
>>>>>>>>>>>>>>>>>>> for this allowing to incrementally add primitive types
>> to the
>>>>>>>>>> hash.
>>>>>>>>>>>>>>>> We can
>>>>>>>>>>>>>>>>>>> discuss this in the PR as it is an implementation detail.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ03: Thanks. I don't think that the replicas are updated
>>>>>>> when
>>>>>>>> a
>>>>>>>>>>>>>>>> broker
>>>>>>>>>>>>>>>>>>> shuts down. What you said applies to the ISR. I suppose
>> that
>>>>>>> we
>>>>>>>>>> can
>>>>>>>>>>>>>>>> rely on
>>>>>>>>>>>>>>>>>>> the ISR changes to trigger updates. It is also worth
>> noting
>>>>>>>>>>>>>>>>>>> that TopicsDelta#changedTopics is updated for every
>> change
>>>>>>>> (e.g.
>>>>>>>>>> ISR
>>>>>>>>>>>>>>>>>>> change, leader change, replicas change, etc.). I suppose
>> that
>>>>>>>> it
>>>>>>>>>> is
>>>>>>>>>>>>>>>> OK but
>>>>>>>>>>>>>>>>>>> it seems that it will trigger refreshes which are not
>>>>>>>> necessary.
>>>>>>>>>>>>>>>> However, a
>>>>>>>>>>>>>>>>>>> rebalance won't be triggered because the hash won't
>> change.
>>>>>>>>>>>>>>>>>>> DJ03.1: I suppose that we will continue to rely on
>>>>>>>>>>>>>>>>>>> ModernGroup#requestMetadataRefresh to notify groups that
>> must
>>>>>>>>>>>>>>>> refresh their
>>>>>>>>>>>>>>>>>>> hashes. Is my understanding correct?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ05: Fair enough.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ06: You mention in two places that you would like to
>>>>>>> combine
>>>>>>>>>>>>>>>> hashes by
>>>>>>>>>>>>>>>>>>> additioning them. I wonder if this is a good practice.
>>>>>>>>>> Intuitively,
>>>>>>>>>>>>>>>> I would
>>>>>>>>>>>>>>>>>>> have used XOR or hashed the hashed. Guava has a method
>> for
>>>>>>>>>> combining
>>>>>>>>>>>>>>>>>>> hashes. It may be worth looking into the algorithm used.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ07: I would rename "AllTopicHash" to "MetadataHash" in
>>>>>>> order
>>>>>>>>>> to be
>>>>>>>>>>>>>>>> more
>>>>>>>>>>>>>>>>>>> generic.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ08: Regarding the per topic hash, I wonder whether we
>>>>>>> should
>>>>>>>>>>>>>>>> precise in
>>>>>>>>>>>>>>>>>>> the KIP how we will compute it. I had the following in
>> mind:
>>>>>>>>>>>>>>>>>>> hash(topicName; numPartitions; [partitionId;sorted
>> racks]).
>>>>>>> We
>>>>>>>>>> could
>>>>>>>>>>>>>>>> also
>>>>>>>>>>>>>>>>>>> add a magic byte at the first element as a sort of
>> version. I
>>>>>>>> am
>>>>>>>>>> not
>>>>>>>>>>>>>>>> sure
>>>>>>>>>>>>>>>>>>> whether it is needed though. I was thinking about this
>> while
>>>>>>>>>>>>>>>> imagining how
>>>>>>>>>>>>>>>>>>> we would handle changing the format in the future.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ09: It would be great if we could provide more details
>>>>>>> about
>>>>>>>>>>>>>>>> backward
>>>>>>>>>>>>>>>>>>> compatibility. What happens when the cluster is upgraded
>> or
>>>>>>>>>>>>>>>> downgraded?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> DJ10: We should update KIP-1071. It may be worth pigging
>> them
>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>> discussion thread of KIP-1071.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Tue, Dec 17, 2024 at 9:25 AM PoAn Yang <
>>>>>>> yangp...@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi Chia-Ping / David / Andrew,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks for the review and suggestions.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> DJ01: Removed all implementation details.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> DJ02: Does the “incrementally” mean that we only
>> calculate
>>>>>>> the
>>>>>>>>>>>>>>>> difference
>>>>>>>>>>>>>>>>>>>> parts?
>>>>>>>>>>>>>>>>>>>> For example, if the number of partition change, we only
>>>>>>>>>> calculate
>>>>>>>>>>>>>>>> the hash
>>>>>>>>>>>>>>>>>>>> of number of partition and reconstruct it to the topic
>> hash.
>>>>>>>>>>>>>>>>>>>> IMO, we only calculate topic hash one time. With cache
>>>>>>>>>> mechanism,
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> value can be reused in different groups on a same
>> broker.
>>>>>>>>>>>>>>>>>>>> The CPU usage for this part is not very high.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> DJ03: Added the update path to KIP for both cases.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> DJ04: Yes, it’s a good idea. With cache mechanism and
>> single
>>>>>>>>>> hash
>>>>>>>>>>>>>>>> per
>>>>>>>>>>>>>>>>>>>> group, we can balance cpu and disk usage.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> DJ05: Currently, the topic hash is only used in
>> coordinator.
>>>>>>>>>>>>>>>> However, the
>>>>>>>>>>>>>>>>>>>> metadata image is used in many different places.
>>>>>>>>>>>>>>>>>>>> How about we move the hash to metadata image when we
>> find
>>>>>>> more
>>>>>>>>>> use
>>>>>>>>>>>>>>>> cases?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> AS1, AS2: Thanks for the reminder. I will simply delete
>>>>>>>>>>>>>>>>>>>> ShareGroupPartitionMetadataKey/Value and add a new
>> field to
>>>>>>>>>>>>>>>>>>>> ShareGroupMetadataValue.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>> PoAn
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On Dec 17, 2024, at 5:50 AM, Andrew Schofield <
>>>>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote:
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Hi PoAn,
>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> AS1: From the point of view of share groups, the API
>> and
>>>>>>>> record
>>>>>>>>>>>>>>>> schema
>>>>>>>>>>>>>>>>>>>>> definitions are unstable in AK 4.0. In AK 4.1, we will
>>>>>>> start
>>>>>>>>>>>>>>>> supporting
>>>>>>>>>>>>>>>>>>>> proper
>>>>>>>>>>>>>>>>>>>>> versioning. As a result, I think you do not need to
>>>>>>> deprecate
>>>>>>>>>>>>>>>> the fields
>>>>>>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>>>>>>>> ShareGroupPartitionMetadataValue. Just include the
>> schema
>>>>>>> for
>>>>>>>>>>>>>>>> the fields
>>>>>>>>>>>>>>>>>>>>> which are actually needed, and I'll update the schema
>> in
>>>>>>> the
>>>>>>>>>>>>>>>> code when
>>>>>>>>>>>>>>>>>>>>> the KIP is implemented.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> AS2: In the event that DJ04 actually removes the need
>> for
>>>>>>>>>>>>>>>>>>>>> ConsumerGroupPartitionMetadataKey/Value entirely, I
>> would
>>>>>>>>>> simply
>>>>>>>>>>>>>>>>>>>>> delete ShareGroupPartitionMetadataKey/Value, assuming
>> that
>>>>>>> it
>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>>> accepted in time for AK 4.1.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Andrew
>>>>>>>>>>>>>>>>>>>>> ________________________________________
>>>>>>>>>>>>>>>>>>>>> From: Chia-Ping Tsai <chia7...@apache.org>
>>>>>>>>>>>>>>>>>>>>> Sent: 16 December 2024 16:27
>>>>>>>>>>>>>>>>>>>>> To: dev@kafka.apache.org <dev@kafka.apache.org>
>>>>>>>>>>>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-1101: Trigger rebalance on
>> rack
>>>>>>>>>>>>>>>> topology
>>>>>>>>>>>>>>>>>>>> changes
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> hi David
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> DJ05
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> One of the benefits of having a single hash per group
>>>>>>> (DJ04)
>>>>>>>> is
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> reduction in the size of stored data. Additionally, the
>> cost
>>>>>>>> of
>>>>>>>>>>>>>>>>>>>> re-computing can be minimized thanks to caching. So I'm
>> + 1
>>>>>>> to
>>>>>>>>>>>>>>>> DJ04.
>>>>>>>>>>>>>>>>>>>> However, the advantage of storing the topic cache in the
>>>>>>>>>> metadata
>>>>>>>>>>>>>>>> image is
>>>>>>>>>>>>>>>>>>>> somewhat unclear to me. Could you please provide more
>>>>>>> details
>>>>>>>> on
>>>>>>>>>>>>>>>> what you
>>>>>>>>>>>>>>>>>>>> mean by "tight"?Furthermore, since the metadata image
>> is a
>>>>>>>>>>>>>>>> thread-safe
>>>>>>>>>>>>>>>>>>>> object, we need to ensure that the lazy initialization
>> is
>>>>>>> also
>>>>>>>>>>>>>>>> thread-safe.
>>>>>>>>>>>>>>>>>>>> If no other components require the cache, it would be
>> better
>>>>>>>> to
>>>>>>>>>>>>>>>> keep the
>>>>>>>>>>>>>>>>>>>> caches within the coordinator.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>> Chia-Ping
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> On 2024/12/16 14:01:35 David Jacot wrote:
>>>>>>>>>>>>>>>>>>>>>> Hi PoAn,
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Thanks for the KIP. I have some comments about it.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> DJ01: Please, remove all the code from the KIP. We
>> only
>>>>>>> care
>>>>>>>>>>>>>>>> about
>>>>>>>>>>>>>>>>>>>> public
>>>>>>>>>>>>>>>>>>>>>> interface changes, not about implementation details.
>>>>>>>>>>>>>>>>>>>>>> DJ02: Regarding the hash computation, I agree that we
>>>>>>> should
>>>>>>>>>> use
>>>>>>>>>>>>>>>>>>>> Murmur3.
>>>>>>>>>>>>>>>>>>>>>> However, I don't quite like the implementation that
>> you
>>>>>>>>>> shared.
>>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>>>>>>> wonder if
>>>>>>>>>>>>>>>>>>>>>> we could make it work incrementally instead of
>> computing a
>>>>>>>>>> hash
>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> everything and combining them.
>>>>>>>>>>>>>>>>>>>>>> DJ03: Regarding the cache, my understanding is that
>> the
>>>>>>>> cache
>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> populated
>>>>>>>>>>>>>>>>>>>>>> when a topic without hash is seen in a HB request and
>> the
>>>>>>>>>> cache
>>>>>>>>>>>>>>>> is
>>>>>>>>>>>>>>>>>>>> cleaned
>>>>>>>>>>>>>>>>>>>>>> up when topics are deleted based on the metadata
>> image.
>>>>>>>>>>>>>>>> However, the
>>>>>>>>>>>>>>>>>>>> update
>>>>>>>>>>>>>>>>>>>>>> path is not clear. Let's say that a partition is
>> added to
>>>>>>> a
>>>>>>>>>>>>>>>> topic, how
>>>>>>>>>>>>>>>>>>>> does
>>>>>>>>>>>>>>>>>>>>>> it detect it? Let's also imagine that the racks of a
>>>>>>>> partition
>>>>>>>>>>>>>>>> have
>>>>>>>>>>>>>>>>>>>>>> changed, how does it detect it? In the KIP, it would
>> be
>>>>>>> nice
>>>>>>>>>> to
>>>>>>>>>>>>>>>> be clear
>>>>>>>>>>>>>>>>>>>>>> about those.
>>>>>>>>>>>>>>>>>>>>>> DJ04: I wonder whether we should go with a single
>> hash per
>>>>>>>>>>>>>>>> group. Your
>>>>>>>>>>>>>>>>>>>>>> argument against it is that it would require to
>> re-compute
>>>>>>>> the
>>>>>>>>>>>>>>>> hash of
>>>>>>>>>>>>>>>>>>>> all
>>>>>>>>>>>>>>>>>>>>>> the topics when it needs to be computed. In my
>> opinion, we
>>>>>>>>>> could
>>>>>>>>>>>>>>>>>>>> leverage
>>>>>>>>>>>>>>>>>>>>>> the cached hash per topic to compute the hash of all
>> the
>>>>>>>>>>>>>>>> subscribed
>>>>>>>>>>>>>>>>>>>> ones.
>>>>>>>>>>>>>>>>>>>>>> We could basically combine all the hashes without
>> having
>>>>>>> to
>>>>>>>>>>>>>>>> compute all
>>>>>>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> them. This approach has a few benefits. 1) We could
>> get
>>>>>>> rid
>>>>>>>> of
>>>>>>>>>>>>>>>>>>>>>> the ConsumerGroupPartitionMetadata record as we could
>>>>>>> store
>>>>>>>>>> the
>>>>>>>>>>>>>>>> hash
>>>>>>>>>>>>>>>>>>>> with
>>>>>>>>>>>>>>>>>>>>>> the group epoch. 2) We could get rid of the Map that
>> we
>>>>>>> keep
>>>>>>>>>> in
>>>>>>>>>>>>>>>> each
>>>>>>>>>>>>>>>>>>>> group
>>>>>>>>>>>>>>>>>>>>>> to store the hashed corresponding to the subscribed
>>>>>>> topics.
>>>>>>>>>>>>>>>>>>>>>> DJ05: Regarding the cache again, I wonder if we should
>>>>>>>>>> actually
>>>>>>>>>>>>>>>> store
>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>> hash in the metadata image instead of maintaining it
>>>>>>>> somewhere
>>>>>>>>>>>>>>>> else. We
>>>>>>>>>>>>>>>>>>>>>> could still lazily compute it. The benefit is that the
>>>>>>> value
>>>>>>>>>>>>>>>> would be
>>>>>>>>>>>>>>>>>>>> tight
>>>>>>>>>>>>>>>>>>>>>> to the topic. I have not really looked into it. Would
>> it
>>>>>>> be
>>>>>>>> an
>>>>>>>>>>>>>>>> option?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> I'll be away for two weeks starting from Saturday. I
>>>>>>> kindly
>>>>>>>>>> ask
>>>>>>>>>>>>>>>> you to
>>>>>>>>>>>>>>>>>>>> wait
>>>>>>>>>>>>>>>>>>>>>> on me if we cannot conclude this week.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>> David
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2024 at 1:43 PM Frank Yang <
>>>>>>>> yangp...@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Hi Chia-Ping,
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks for the review and suggestions.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Q0: Add how rack change and how it affects topic
>>>>>>> partition.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Q1: Add why we need a balance algorithm to Motivation
>>>>>>>>>> section.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Q2: After checking again, we don’t need to update
>> cache
>>>>>>>> when
>>>>>>>>>>>>>>>> we replay
>>>>>>>>>>>>>>>>>>>>>>> records. We only need to renew it in consumer
>> heartbeat.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Q3: Add a new section “Topic Hash Function”.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>> PoAn
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Nov 1, 2024, at 4:39 PM, Chia-Ping Tsai <
>>>>>>>>>>>>>>>> chia7...@gmail.com>
>>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> hi PoAn
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Thanks for for this KIP!
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Q0: Could you add more details about `A topic
>> partition
>>>>>>>> has
>>>>>>>>>>>>>>>> rack
>>>>>>>>>>>>>>>>>>>> change`?
>>>>>>>>>>>>>>>>>>>>>>>> IIRC, the "rack change" includes both follower and
>>>>>>> leader,
>>>>>>>>>>>>>>>> right?
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Q1: Could you please add the 'concerns' we
>> discussed to
>>>>>>>> the
>>>>>>>>>>>>>>>> Motivation
>>>>>>>>>>>>>>>>>>>>>>>> section? This should include topics like
>> 'computations'
>>>>>>>> and
>>>>>>>>>>>>>>>> 'space
>>>>>>>>>>>>>>>>>>>>>>> usage'.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Q2: `The group coordinator can leverage it to add a
>> new
>>>>>>>>>> topic
>>>>>>>>>>>>>>>>>>>> hash.`This
>>>>>>>>>>>>>>>>>>>>>>>> description seems a bit off to me. Why do we need to
>>>>>>>> update
>>>>>>>>>>>>>>>> the cache
>>>>>>>>>>>>>>>>>>>> at
>>>>>>>>>>>>>>>>>>>>>>>> this phase? The cache is intended to prevent
>> duplicate
>>>>>>>>>>>>>>>> computations
>>>>>>>>>>>>>>>>>>>>>>> caused
>>>>>>>>>>>>>>>>>>>>>>>> by heartbeat requests that occur between two
>> metadata
>>>>>>>> change
>>>>>>>>>>>>>>>> events.
>>>>>>>>>>>>>>>>>>>>>>>> Therefore, we could even remove the changed topics
>> from
>>>>>>>>>>>>>>>> caches on a
>>>>>>>>>>>>>>>>>>>>>>>> metadata change, as the first heartbeat request
>> would
>>>>>>>> update
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>> caches
>>>>>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>>>>>>> all changed topics.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Q3: Could you please include a section about the
>> choice
>>>>>>> of
>>>>>>>>>>>>>>>> hash
>>>>>>>>>>>>>>>>>>>>>>>> implementation? The hash implementation must be
>>>>>>> consistent
>>>>>>>>>>>>>>>> across
>>>>>>>>>>>>>>>>>>>>>>> different
>>>>>>>>>>>>>>>>>>>>>>>> JDKs, so we use Murmur3 to generate the hash value.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>>>>>>>>>>> Chia-Ping
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Frank Yang <yangp...@gmail.com> 於 2024年11月1日 週五
>>>>>>> 下午3:57寫道：
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> I would like to start a discussion thread on
>> KIP-1101.
>>>>>>>>>>>>>>>> Trigger
>>>>>>>>>>>>>>>>>>>> rebalance
>>>>>>>>>>>>>>>>>>>>>>>>> on rack topology changes. In this KIP, we aim to
>> use
>>>>>>> less
>>>>>>>>>>>>>>>> memory /
>>>>>>>>>>>>>>>>>>>> disk
>>>>>>>>>>>>>>>>>>>>>>>>> resources to detect rack changes in the new
>>>>>>> coordinator.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1101%3A+Trigger+rebalance+on+rack+topology+changes
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Please take a look and feel free to share any
>> thoughts.
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>>>>>>>>>>>>> PoAn

Re: [DISCUSS] KIP-1101: Trigger rebalance on rack topology changes

Reply via email to