Hi PoAn, Thanks for your very detailed answers!
DJ05: This seems reasonable. We can look into it more deeply during the implementation too. DJ07: Yeah, I see your point. I don't have a strict opinion on this so I am happy to go with the majority. Could you please at least mention in the rejected alternative section? It would be good to have a trace of it. Best, David On Wed, Apr 16, 2025 at 4:23 AM PoAn Yang <yangp...@gmail.com> wrote: > Hi Lucas, > > Thank you. Here is vote thread > https://lists.apache.org/thread/j90htmphjk783p697jjfg1xt8thmy33p. > > Regards, > PoAn > > > On Apr 16, 2025, at 12:07 AM, Lucas Brutschy > > <lbruts...@confluent.io.INVALID> > wrote: > > > > Hi PoAn, > > > > from my side, we are good to move to a vote on this one. > > > > Cheers, > > Lucas > > > > On Thu, Apr 10, 2025 at 11:29 AM PoAn Yang <yangp...@gmail.com <mailto: > yangp...@gmail.com>> wrote: > >> > >> Hi David and Lucas, > >> > >> Thanks for the review and great suggestion. > >> > >> DJ01: Add the statement about rack-aware rebalance was removed in 4.0. > >> > >> DJ02: Change ConsumerGroupPartitionMetadataKey and > >> ConsumerGroupPartitionMetadataValue from removed to deprecated. > >> > >> DJ03: In my sample implementation < > https://github.com/apache/kafka/pull/17444>, the cache is maintained in > GroupCoordinatorManager. > >> A GroupCoordinatorShard only have one GroupCoordinatorManager, so I > think it’s similar to > >> a cache per GroupCoordinatorShard instance. > >> > >> DJ04: Add the order about how topic data will be computed to get a > topic hash. > >> > >> DJ05: This is a good question. In topic hash, the Guava uses “final > mix” to ensure avalanche > >> effect. However for the single hash of subscribed topics in a group, we > cannot use simple > >> sum, because we need to avoid the edge case. For example, hash(a+b) and > hash(b+a) is > >> same. To fix this case, the hash function sorts topics by name and the > topic hash multiplied > >> by the position value like hash(a + 2b) and hash(b + 2a). I add this > case and regular expression case to the KIP. > >> > >> DJ06: It’s good to add tombstone of ConsumerGroupPartitionMetadata, so > log compaction > >> can remove old data. Add this part to compatibility section. > >> > >> DJ07: In previous discussion, you mentioned that assignors are sticky. > If the topology of the > >> group is the same, the upgrade/downgrade rebalance doesn’t change > anything. > >> > >>>>>>>>>>>> DJ08: In my opinion, changing the format will be rare so it is > >>>>>>>>>>>> acceptable if rebalances are triggered in this case on > >>>>>>>>>>>> upgrade/downgrade. It is also what will happen if a cluster is > >>>>>>>>>>>> downgraded from 4.1 (with this KIP) to 4.0. The rebalance > won't > >>>>>> change > >>>>>>>>>>>> anything if the topology of the group is the same because > >>>> assignors > >>>>>>>>>>>> are sticky. The default ones are and we recommend custom ones > to > >>>>>> also > >>>>>>>>>>>> be. > >> > >> With `group.version` change, we need to maintain two code in 4.1. One > is subscription > >> metadata without rack and another is group hash. Another drawback is > that changing > >> `group.version` will trigger all group rebalance at the same time. > IMHO, I would prefer > >> to keep the code simple. > >> > >> DJ08: Add ConsumerGroupHeartbeatRequestTest to test plan. > >> > >> LB01: Add the removal of StreamGroupPartitionMetadataKey / > >> StreamGrouprtitionMetadataValue to public interface change section. > >> > >> Thank you. > >> > >> Regards, > >> PoAn > >> > >>> On Apr 9, 2025, at 5:11 PM, Lucas Brutschy > >>> <lbruts...@confluent.io.INVALID> > wrote: > >>> > >>> Hi PoAn, > >>> > >>> LB01: please add the removal of StreamsGroupPartitionMetadataKey / > >>> StreamsGroupPartitionMetadataValue to the KIP. Since these were not > >>> yet included in 4.0, we can just remove them as long as this KIP hits > >>> 4.1. I can help with the integration into the `streamsGroupHeartbeat`, > >>> as it will be a little different, but we do not need to call this out > >>> in the KIP. > >>> > >>> Thanks for your work on this! > >>> > >>> Cheers, > >>> Lucas > >>> > >>> On Wed, Apr 9, 2025 at 9:59 AM David Jacot <dja...@confluent.io.invalid> > wrote: > >>>> > >>>> Hi PoAn, > >>>> > >>>> I finally had a bit of time to review your KIP. First, let me say > that the > >>>> KIP is well written! I have some more comments: > >>>> > >>>> DJ01: In the motivation, could you please mention that we actually > removed > >>>> triggering rebalances based on rack changes in 4.0? You explained > very well > >>>> why we removed it but it is not clear that we removed it. > >>>> > >>>> DJ02: We cannot delete ConsumerGroupPartitionMetadataKey > >>>> and ConsumerGroupPartitionMetadataValue because the coordinator must > still > >>>> be able to read the record from the log when the cluster is upgraded. > >>>> However, we can deprecate them. > >>>> > >>>> DJ03: Could you clarify how the cache will be scoped? My > understanding is > >>>> that we will maintain a cache per GroupCoordinatorShard instance. Is > this > >>>> correct? > >>>> > >>>> DJ04: Regarding the topic hash function, would it be possible to > specify > >>>> how the hash will be computed instead of providing the signature of > the > >>>> method? > >>>> > >>>> DJ05: I have a general question about how you propose to combine > hashes. Is > >>>> using a sum enough? I was looking at the Guava implementation and > they also > >>>> apply a "final mix" to ensure an avalanche effect. It would be great > to > >>>> elaborate a bit more on this in the KIP. > >>>> > >>>> DJ06: After the upgrade, we need to ensure that a tombstone is > written for > >>>> the ConsumerGroupPartitionMetadata record if it was present. I > suppose that > >>>> this is something that we could do when the next metadata hash is > computed. > >>>> Have you thought about it? > >>>> > >>>> DJ07: I wonder whether we should gate the change with > `group.version`. My > >>>> concern is that during upgrade, not all the coordinators will support > the > >>>> new way and groups may bounce between those coordinators as their > >>>> underlying __consumer_offsets partitions are moved within the cluster > >>>> (during the roll). This may trigger unnecessary rebalances. One way to > >>>> avoid this is to gate the change with `group.version` which is only > bumped > >>>> at the end of the cluster upgrade process. What do you think? > >>>> > >>>> DJ08: Regarding the test plan, we should also extend > >>>> ConsumerGroupHeartbeatRequestTest to cover the new feature. > >>>> > >>>> Thanks for your hard work on this one. Let's try to get it in 4.1. > >>>> > >>>> Best, > >>>> David > >>>> > >>>> On Mon, Mar 24, 2025 at 8:17 PM Jeff Kim > <jeff....@confluent.io.invalid> > >>>> wrote: > >>>> > >>>>> Hi PoAn, > >>>>> > >>>>> Thanks for the clarification! > >>>>> > >>>>> Best, > >>>>> Jeff > >>>>> > >>>>> On Mon, Mar 24, 2025 at 11:43 AM PoAn Yang <yangp...@gmail.com> > wrote: > >>>>> > >>>>>> Hi Jeff, > >>>>>> > >>>>>> Thanks for the review. > >>>>>> > >>>>>> JK01. Yes, we already maintain `groupByTopics` data structure in the > >>>>>> coordinator. If a group leaves, > >>>>>> it checks whether a topic is subscribed by any group. If it’s not, > remove > >>>>>> topic hash from the cache. > >>>>>> Please check the sample implementation about > >>>>>> GroupMetadataManager#unsubscribeGroupFromTopic > >>>>>> in #17444 <https://github.com/apache/kafka/pull/17444>. > >>>>>> > >>>>>> JK02: This KIP replaces subscription metadata by a group hash, so > the > >>>>>> coordinator calculates group > >>>>>> hash where we calculated subscription metadata. The trigger points > are > >>>>>> like a group member is updated, > >>>>>> , new group member, new subscribed topic names, new regular > expressions, > >>>>>> and new metadata (topic change). > >>>>>> > >>>>>> Best Regards, > >>>>>> PoAn > >>>>>> > >>>>>>> On Mar 22, 2025, at 3:25 AM, Jeff Kim > <jeff....@confluent.io.INVALID> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi PoAn, > >>>>>>> > >>>>>>> I was late to the discussion but I had some questions. > >>>>>>> > >>>>>>> JK01. I foresee scenarios where the cache leaves stale topic hashes > >>>>>> around > >>>>>>> when a group leaves. > >>>>>>> Should we maintain a counter for the number of groups subscribed > to the > >>>>>>> topic (hash) to only keep active topics? > >>>>>>> > >>>>>>> JK02. Can you help me understand when we would actually be > computing > >>>>>>> group-level hashes, is it on every > >>>>>>> heartbeat? > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Jeff > >>>>>>> > >>>>>>> On Wed, Mar 12, 2025 at 11:33 PM PoAn Yang <yangp...@gmail.com> > wrote: > >>>>>>> > >>>>>>>> Hi David, > >>>>>>>> > >>>>>>>> Yes, I will continue working on this. > >>>>>>>> > >>>>>>>> DJ02: As Kirk noted earlier, since the topic hash calculation is > on > >>>>>> server > >>>>>>>> side (not client-side), we can safely introduce Guava as a > dependency. > >>>>>>>> > >>>>>>>> If there is no other discussion, we can start voting in > >>>>>>>> https://lists.apache.org/thread/j90htmphjk783p697jjfg1xt8thmy33p. > We > >>>>>> can > >>>>>>>> discuss implementation details in PR > >>>>>>>> https://github.com/apache/kafka/pull/17444. I will resolve merge > >>>>>>>> conflicts by March 14th. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> PoAn > >>>>>>>> > >>>>>>>>> On Mar 13, 2025, at 12:43 AM, David Jacot <david.ja...@gmail.com > > > >>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Hi PoAn, > >>>>>>>>> > >>>>>>>>> As we are getting closer to releasing 4.0, I think that we can > resume > >>>>>>>>> working on this one if you are still interested. Are you? I would > >>>>> like > >>>>>>>>> to have it in 4.1. > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> David > >>>>>>>>> > >>>>>>>>> On Wed, Jan 15, 2025 at 1:26 AM Kirk True <k...@kirktrue.pro> > wrote: > >>>>>>>>>> > >>>>>>>>>> Hi all, > >>>>>>>>>> > >>>>>>>>>> Hopefully a quick question... > >>>>>>>>>> > >>>>>>>>>> KT01. Will clients calculate the topic hash on the client? > Based on > >>>>>> the > >>>>>>>> current state of the KIP and PR, I would have thought "no", but I > ask > >>>>>> based > >>>>>>>> on the discussion around the possible use of Guava on client. > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Kirk > >>>>>>>>>> > >>>>>>>>>> On Mon, Jan 6, 2025, at 9:11 AM, David Jacot wrote: > >>>>>>>>>>> Hi PoAn, > >>>>>>>>>>> > >>>>>>>>>>> Thanks for the update. I haven't read the updated KIP yet. > >>>>>>>>>>> > >>>>>>>>>>> DJ02: I am not sure about using Guava as a dependency. I > mentioned > >>>>> it > >>>>>>>> more > >>>>>>>>>>> as an inspiration/reference. I suppose that we could use it on > the > >>>>>>>> server > >>>>>>>>>>> but we should definitely not use it on the client. I am not > sure > >>>>> how > >>>>>>>> others > >>>>>>>>>>> feel about it. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> David > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Jan 6, 2025 at 5:21 AM PoAn Yang <yangp...@gmail.com> > >>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Chia-Ping / David / Lucas, > >>>>>>>>>>>> > >>>>>>>>>>>> Happy new year and thanks for the review. > >>>>>>>>>>>> > >>>>>>>>>>>> DJ02: Thanks for the suggestion. I updated the PR to use > Guava. > >>>>>>>>>>>> > >>>>>>>>>>>> DJ03: Yes, I updated the description to mention ISR change, > >>>>>>>>>>>> add altering partition reassignment case, and mention that > >>>>>>>>>>>> non-related topic change doesn’t trigger a rebalance. > >>>>>>>>>>>> DJ03.1: Yes, I will keep using > ModernGroup#requestMetadataRefresh > >>>>>>>>>>>> to notify group. > >>>>>>>>>>>> > >>>>>>>>>>>> DJ06: Updated the PR to use Guava Hashing#combineUnordered > >>>>>>>>>>>> function to combine topic hash. > >>>>>>>>>>>> > >>>>>>>>>>>> DJ07: Renamed it to MetadataHash. > >>>>>>>>>>>> > >>>>>>>>>>>> DJ08: Added a sample hash function to the KIP and use first > byte > >>>>> as > >>>>>>>> magic > >>>>>>>>>>>> byte. This is also included in latest PR. > >>>>>>>>>>>> > >>>>>>>>>>>> DJ09: Added two paragraphs about upgraded and downgraded. > >>>>>>>>>>>> > >>>>>>>>>>>> DJ10: According to Lucas’s comment, I add > >>>>> StreamsGroupMetadataValue > >>>>>>>> update > >>>>>>>>>>>> to this KIP. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> PoAn > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> On Dec 20, 2024, at 3:58 PM, Chia-Ping Tsai < > chia7...@gmail.com> > >>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> because assignors are sticky. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I forgot about that spec again :( > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> David Jacot <david.ja...@gmail.com> 於 2024年12月20日 週五 > 下午3:41寫道: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Hi Chia-Ping, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> DJ08: In my opinion, changing the format will be rare so it > is > >>>>>>>>>>>>>> acceptable if rebalances are triggered in this case on > >>>>>>>>>>>>>> upgrade/downgrade. It is also what will happen if a cluster > is > >>>>>>>>>>>>>> downgraded from 4.1 (with this KIP) to 4.0. The rebalance > won't > >>>>>>>> change > >>>>>>>>>>>>>> anything if the topology of the group is the same because > >>>>>> assignors > >>>>>>>>>>>>>> are sticky. The default ones are and we recommend custom > ones to > >>>>>>>> also > >>>>>>>>>>>>>> be. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>> David > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, Dec 20, 2024 at 2:11 AM Chia-Ping Tsai < > >>>>>> chia7...@apache.org > >>>>>>>>> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> ummm, it does not work for downgrade as the old > coordinator has > >>>>>> no > >>>>>>>> idea > >>>>>>>>>>>>>> about new format :( > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On 2024/12/20 00:57:27 Chia-Ping Tsai wrote: > >>>>>>>>>>>>>>>> hi David > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ08: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> That's a good question. If the "hash" lacks version > control, > >>>>> it > >>>>>>>> could > >>>>>>>>>>>>>> trigger a series of unnecessary rebalances. However, adding > >>>>>>>> additional > >>>>>>>>>>>>>> information ("magic") to the hash does not help the upgraded > >>>>>>>> coordinator > >>>>>>>>>>>>>> determine the "version." This means that the upgraded > >>>>> coordinator > >>>>>>>> would > >>>>>>>>>>>>>> still trigger unnecessary rebalances because it has no way > to > >>>>> know > >>>>>>>> which > >>>>>>>>>>>>>> format to use when comparing the hash. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Perhaps we can add a new field to > ConsumerGroupMetadataValue > >>>>> to > >>>>>>>>>>>>>> indicate the version of the "hash." This would allow the > >>>>>>>> coordinator, > >>>>>>>>>>>> when > >>>>>>>>>>>>>> handling subscription metadata, to compute the old hash and > >>>>>>>> determine > >>>>>>>>>>>>>> whether an epoch bump is necessary. Additionally, the > >>>>> coordinator > >>>>>>>> can > >>>>>>>>>>>>>> generate a new record to upgrade the hash without requiring > an > >>>>>> epoch > >>>>>>>>>>>> bump. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Another issue is whether the coordinator should cache all > >>>>>>>> versions of > >>>>>>>>>>>>>> the hash. I believe this is necessary; otherwise, during an > >>>>>> upgrade, > >>>>>>>>>>>> there > >>>>>>>>>>>>>> would be extensive recomputing of old hashes. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I believe this idea should also work for downgrades, and > >>>>> that's > >>>>>>>> just > >>>>>>>>>>>>>> my two cents. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>> Chia-Ping > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On 2024/12/19 14:39:41 David Jacot wrote: > >>>>>>>>>>>>>>>>> Hi PoAn and Chia-Ping, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Thanks for your responses. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ02: Sorry, I was not clear. I was wondering whether we > >>>>> could > >>>>>>>>>>>>>> compute the > >>>>>>>>>>>>>>>>> hash without having to convert to bytes before. Guava > has a > >>>>>> nice > >>>>>>>>>>>>>> interface > >>>>>>>>>>>>>>>>> for this allowing to incrementally add primitive types > to the > >>>>>>>> hash. > >>>>>>>>>>>>>> We can > >>>>>>>>>>>>>>>>> discuss this in the PR as it is an implementation detail. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ03: Thanks. I don't think that the replicas are updated > >>>>> when > >>>>>> a > >>>>>>>>>>>>>> broker > >>>>>>>>>>>>>>>>> shuts down. What you said applies to the ISR. I suppose > that > >>>>> we > >>>>>>>> can > >>>>>>>>>>>>>> rely on > >>>>>>>>>>>>>>>>> the ISR changes to trigger updates. It is also worth > noting > >>>>>>>>>>>>>>>>> that TopicsDelta#changedTopics is updated for every > change > >>>>>> (e.g. > >>>>>>>> ISR > >>>>>>>>>>>>>>>>> change, leader change, replicas change, etc.). I suppose > that > >>>>>> it > >>>>>>>> is > >>>>>>>>>>>>>> OK but > >>>>>>>>>>>>>>>>> it seems that it will trigger refreshes which are not > >>>>>> necessary. > >>>>>>>>>>>>>> However, a > >>>>>>>>>>>>>>>>> rebalance won't be triggered because the hash won't > change. > >>>>>>>>>>>>>>>>> DJ03.1: I suppose that we will continue to rely on > >>>>>>>>>>>>>>>>> ModernGroup#requestMetadataRefresh to notify groups that > must > >>>>>>>>>>>>>> refresh their > >>>>>>>>>>>>>>>>> hashes. Is my understanding correct? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ05: Fair enough. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ06: You mention in two places that you would like to > >>>>> combine > >>>>>>>>>>>>>> hashes by > >>>>>>>>>>>>>>>>> additioning them. I wonder if this is a good practice. > >>>>>>>> Intuitively, > >>>>>>>>>>>>>> I would > >>>>>>>>>>>>>>>>> have used XOR or hashed the hashed. Guava has a method > for > >>>>>>>> combining > >>>>>>>>>>>>>>>>> hashes. It may be worth looking into the algorithm used. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ07: I would rename "AllTopicHash" to "MetadataHash" in > >>>>> order > >>>>>>>> to be > >>>>>>>>>>>>>> more > >>>>>>>>>>>>>>>>> generic. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ08: Regarding the per topic hash, I wonder whether we > >>>>> should > >>>>>>>>>>>>>> precise in > >>>>>>>>>>>>>>>>> the KIP how we will compute it. I had the following in > mind: > >>>>>>>>>>>>>>>>> hash(topicName; numPartitions; [partitionId;sorted > racks]). > >>>>> We > >>>>>>>> could > >>>>>>>>>>>>>> also > >>>>>>>>>>>>>>>>> add a magic byte at the first element as a sort of > version. I > >>>>>> am > >>>>>>>> not > >>>>>>>>>>>>>> sure > >>>>>>>>>>>>>>>>> whether it is needed though. I was thinking about this > while > >>>>>>>>>>>>>> imagining how > >>>>>>>>>>>>>>>>> we would handle changing the format in the future. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ09: It would be great if we could provide more details > >>>>> about > >>>>>>>>>>>>>> backward > >>>>>>>>>>>>>>>>> compatibility. What happens when the cluster is upgraded > or > >>>>>>>>>>>>>> downgraded? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> DJ10: We should update KIP-1071. It may be worth pigging > them > >>>>>> in > >>>>>>>> the > >>>>>>>>>>>>>>>>> discussion thread of KIP-1071. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>> David > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Tue, Dec 17, 2024 at 9:25 AM PoAn Yang < > >>>>> yangp...@gmail.com> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi Chia-Ping / David / Andrew, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Thanks for the review and suggestions. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> DJ01: Removed all implementation details. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> DJ02: Does the “incrementally” mean that we only > calculate > >>>>> the > >>>>>>>>>>>>>> difference > >>>>>>>>>>>>>>>>>> parts? > >>>>>>>>>>>>>>>>>> For example, if the number of partition change, we only > >>>>>>>> calculate > >>>>>>>>>>>>>> the hash > >>>>>>>>>>>>>>>>>> of number of partition and reconstruct it to the topic > hash. > >>>>>>>>>>>>>>>>>> IMO, we only calculate topic hash one time. With cache > >>>>>>>> mechanism, > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> value can be reused in different groups on a same > broker. > >>>>>>>>>>>>>>>>>> The CPU usage for this part is not very high. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> DJ03: Added the update path to KIP for both cases. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> DJ04: Yes, it’s a good idea. With cache mechanism and > single > >>>>>>>> hash > >>>>>>>>>>>>>> per > >>>>>>>>>>>>>>>>>> group, we can balance cpu and disk usage. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> DJ05: Currently, the topic hash is only used in > coordinator. > >>>>>>>>>>>>>> However, the > >>>>>>>>>>>>>>>>>> metadata image is used in many different places. > >>>>>>>>>>>>>>>>>> How about we move the hash to metadata image when we > find > >>>>> more > >>>>>>>> use > >>>>>>>>>>>>>> cases? > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> AS1, AS2: Thanks for the reminder. I will simply delete > >>>>>>>>>>>>>>>>>> ShareGroupPartitionMetadataKey/Value and add a new > field to > >>>>>>>>>>>>>>>>>> ShareGroupMetadataValue. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>> PoAn > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Dec 17, 2024, at 5:50 AM, Andrew Schofield < > >>>>>>>>>>>>>>>>>> andrew_schofield_j...@outlook.com> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hi PoAn, > >>>>>>>>>>>>>>>>>>> Thanks for the KIP. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> AS1: From the point of view of share groups, the API > and > >>>>>> record > >>>>>>>>>>>>>> schema > >>>>>>>>>>>>>>>>>>> definitions are unstable in AK 4.0. In AK 4.1, we will > >>>>> start > >>>>>>>>>>>>>> supporting > >>>>>>>>>>>>>>>>>> proper > >>>>>>>>>>>>>>>>>>> versioning. As a result, I think you do not need to > >>>>> deprecate > >>>>>>>>>>>>>> the fields > >>>>>>>>>>>>>>>>>> in the > >>>>>>>>>>>>>>>>>>> ShareGroupPartitionMetadataValue. Just include the > schema > >>>>> for > >>>>>>>>>>>>>> the fields > >>>>>>>>>>>>>>>>>>> which are actually needed, and I'll update the schema > in > >>>>> the > >>>>>>>>>>>>>> code when > >>>>>>>>>>>>>>>>>>> the KIP is implemented. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> AS2: In the event that DJ04 actually removes the need > for > >>>>>>>>>>>>>>>>>>> ConsumerGroupPartitionMetadataKey/Value entirely, I > would > >>>>>>>> simply > >>>>>>>>>>>>>>>>>>> delete ShareGroupPartitionMetadataKey/Value, assuming > that > >>>>> it > >>>>>>>> is > >>>>>>>>>>>>>>>>>>> accepted in time for AK 4.1. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>>>>>>> Andrew > >>>>>>>>>>>>>>>>>>> ________________________________________ > >>>>>>>>>>>>>>>>>>> From: Chia-Ping Tsai <chia7...@apache.org> > >>>>>>>>>>>>>>>>>>> Sent: 16 December 2024 16:27 > >>>>>>>>>>>>>>>>>>> To: dev@kafka.apache.org <dev@kafka.apache.org> > >>>>>>>>>>>>>>>>>>> Subject: Re: [DISCUSS] KIP-1101: Trigger rebalance on > rack > >>>>>>>>>>>>>> topology > >>>>>>>>>>>>>>>>>> changes > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> hi David > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> DJ05 > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> One of the benefits of having a single hash per group > >>>>> (DJ04) > >>>>>> is > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> reduction in the size of stored data. Additionally, the > cost > >>>>>> of > >>>>>>>>>>>>>>>>>> re-computing can be minimized thanks to caching. So I'm > + 1 > >>>>> to > >>>>>>>>>>>>>> DJ04. > >>>>>>>>>>>>>>>>>> However, the advantage of storing the topic cache in the > >>>>>>>> metadata > >>>>>>>>>>>>>> image is > >>>>>>>>>>>>>>>>>> somewhat unclear to me. Could you please provide more > >>>>> details > >>>>>> on > >>>>>>>>>>>>>> what you > >>>>>>>>>>>>>>>>>> mean by "tight"?Furthermore, since the metadata image > is a > >>>>>>>>>>>>>> thread-safe > >>>>>>>>>>>>>>>>>> object, we need to ensure that the lazy initialization > is > >>>>> also > >>>>>>>>>>>>>> thread-safe. > >>>>>>>>>>>>>>>>>> If no other components require the cache, it would be > better > >>>>>> to > >>>>>>>>>>>>>> keep the > >>>>>>>>>>>>>>>>>> caches within the coordinator. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>>> Chia-Ping > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On 2024/12/16 14:01:35 David Jacot wrote: > >>>>>>>>>>>>>>>>>>>> Hi PoAn, > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks for the KIP. I have some comments about it. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> DJ01: Please, remove all the code from the KIP. We > only > >>>>> care > >>>>>>>>>>>>>> about > >>>>>>>>>>>>>>>>>> public > >>>>>>>>>>>>>>>>>>>> interface changes, not about implementation details. > >>>>>>>>>>>>>>>>>>>> DJ02: Regarding the hash computation, I agree that we > >>>>> should > >>>>>>>> use > >>>>>>>>>>>>>>>>>> Murmur3. > >>>>>>>>>>>>>>>>>>>> However, I don't quite like the implementation that > you > >>>>>>>> shared. > >>>>>>>>>>>>>> I > >>>>>>>>>>>>>>>>>> wonder if > >>>>>>>>>>>>>>>>>>>> we could make it work incrementally instead of > computing a > >>>>>>>> hash > >>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>>>> everything and combining them. > >>>>>>>>>>>>>>>>>>>> DJ03: Regarding the cache, my understanding is that > the > >>>>>> cache > >>>>>>>> is > >>>>>>>>>>>>>>>>>> populated > >>>>>>>>>>>>>>>>>>>> when a topic without hash is seen in a HB request and > the > >>>>>>>> cache > >>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>> cleaned > >>>>>>>>>>>>>>>>>>>> up when topics are deleted based on the metadata > image. > >>>>>>>>>>>>>> However, the > >>>>>>>>>>>>>>>>>> update > >>>>>>>>>>>>>>>>>>>> path is not clear. Let's say that a partition is > added to > >>>>> a > >>>>>>>>>>>>>> topic, how > >>>>>>>>>>>>>>>>>> does > >>>>>>>>>>>>>>>>>>>> it detect it? Let's also imagine that the racks of a > >>>>>> partition > >>>>>>>>>>>>>> have > >>>>>>>>>>>>>>>>>>>> changed, how does it detect it? In the KIP, it would > be > >>>>> nice > >>>>>>>> to > >>>>>>>>>>>>>> be clear > >>>>>>>>>>>>>>>>>>>> about those. > >>>>>>>>>>>>>>>>>>>> DJ04: I wonder whether we should go with a single > hash per > >>>>>>>>>>>>>> group. Your > >>>>>>>>>>>>>>>>>>>> argument against it is that it would require to > re-compute > >>>>>> the > >>>>>>>>>>>>>> hash of > >>>>>>>>>>>>>>>>>> all > >>>>>>>>>>>>>>>>>>>> the topics when it needs to be computed. In my > opinion, we > >>>>>>>> could > >>>>>>>>>>>>>>>>>> leverage > >>>>>>>>>>>>>>>>>>>> the cached hash per topic to compute the hash of all > the > >>>>>>>>>>>>>> subscribed > >>>>>>>>>>>>>>>>>> ones. > >>>>>>>>>>>>>>>>>>>> We could basically combine all the hashes without > having > >>>>> to > >>>>>>>>>>>>>> compute all > >>>>>>>>>>>>>>>>>> of > >>>>>>>>>>>>>>>>>>>> them. This approach has a few benefits. 1) We could > get > >>>>> rid > >>>>>> of > >>>>>>>>>>>>>>>>>>>> the ConsumerGroupPartitionMetadata record as we could > >>>>> store > >>>>>>>> the > >>>>>>>>>>>>>> hash > >>>>>>>>>>>>>>>>>> with > >>>>>>>>>>>>>>>>>>>> the group epoch. 2) We could get rid of the Map that > we > >>>>> keep > >>>>>>>> in > >>>>>>>>>>>>>> each > >>>>>>>>>>>>>>>>>> group > >>>>>>>>>>>>>>>>>>>> to store the hashed corresponding to the subscribed > >>>>> topics. > >>>>>>>>>>>>>>>>>>>> DJ05: Regarding the cache again, I wonder if we should > >>>>>>>> actually > >>>>>>>>>>>>>> store > >>>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>>> hash in the metadata image instead of maintaining it > >>>>>> somewhere > >>>>>>>>>>>>>> else. We > >>>>>>>>>>>>>>>>>>>> could still lazily compute it. The benefit is that the > >>>>> value > >>>>>>>>>>>>>> would be > >>>>>>>>>>>>>>>>>> tight > >>>>>>>>>>>>>>>>>>>> to the topic. I have not really looked into it. Would > it > >>>>> be > >>>>>> an > >>>>>>>>>>>>>> option? > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> I'll be away for two weeks starting from Saturday. I > >>>>> kindly > >>>>>>>> ask > >>>>>>>>>>>>>> you to > >>>>>>>>>>>>>>>>>> wait > >>>>>>>>>>>>>>>>>>>> on me if we cannot conclude this week. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>>>> David > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> On Tue, Nov 5, 2024 at 1:43 PM Frank Yang < > >>>>>> yangp...@gmail.com > >>>>>>>>> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Hi Chia-Ping, > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Thanks for the review and suggestions. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Q0: Add how rack change and how it affects topic > >>>>> partition. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Q1: Add why we need a balance algorithm to Motivation > >>>>>>>> section. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Q2: After checking again, we don’t need to update > cache > >>>>>> when > >>>>>>>>>>>>>> we replay > >>>>>>>>>>>>>>>>>>>>> records. We only need to renew it in consumer > heartbeat. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Q3: Add a new section “Topic Hash Function”. > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> Thanks. > >>>>>>>>>>>>>>>>>>>>> PoAn > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> On Nov 1, 2024, at 4:39 PM, Chia-Ping Tsai < > >>>>>>>>>>>>>> chia7...@gmail.com> > >>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> hi PoAn > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Thanks for for this KIP! > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Q0: Could you add more details about `A topic > partition > >>>>>> has > >>>>>>>>>>>>>> rack > >>>>>>>>>>>>>>>>>> change`? > >>>>>>>>>>>>>>>>>>>>>> IIRC, the "rack change" includes both follower and > >>>>> leader, > >>>>>>>>>>>>>> right? > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Q1: Could you please add the 'concerns' we > discussed to > >>>>>> the > >>>>>>>>>>>>>> Motivation > >>>>>>>>>>>>>>>>>>>>>> section? This should include topics like > 'computations' > >>>>>> and > >>>>>>>>>>>>>> 'space > >>>>>>>>>>>>>>>>>>>>> usage'. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Q2: `The group coordinator can leverage it to add a > new > >>>>>>>> topic > >>>>>>>>>>>>>>>>>> hash.`This > >>>>>>>>>>>>>>>>>>>>>> description seems a bit off to me. Why do we need to > >>>>>> update > >>>>>>>>>>>>>> the cache > >>>>>>>>>>>>>>>>>> at > >>>>>>>>>>>>>>>>>>>>>> this phase? The cache is intended to prevent > duplicate > >>>>>>>>>>>>>> computations > >>>>>>>>>>>>>>>>>>>>> caused > >>>>>>>>>>>>>>>>>>>>>> by heartbeat requests that occur between two > metadata > >>>>>> change > >>>>>>>>>>>>>> events. > >>>>>>>>>>>>>>>>>>>>>> Therefore, we could even remove the changed topics > from > >>>>>>>>>>>>>> caches on a > >>>>>>>>>>>>>>>>>>>>>> metadata change, as the first heartbeat request > would > >>>>>> update > >>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> caches > >>>>>>>>>>>>>>>>>>>>> for > >>>>>>>>>>>>>>>>>>>>>> all changed topics. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Q3: Could you please include a section about the > choice > >>>>> of > >>>>>>>>>>>>>> hash > >>>>>>>>>>>>>>>>>>>>>> implementation? The hash implementation must be > >>>>> consistent > >>>>>>>>>>>>>> across > >>>>>>>>>>>>>>>>>>>>> different > >>>>>>>>>>>>>>>>>>>>>> JDKs, so we use Murmur3 to generate the hash value. > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>>>>>>>>>>> Chia-Ping > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>> Frank Yang <yangp...@gmail.com> 於 2024年11月1日 週五 > >>>>> 下午3:57寫道: > >>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> I would like to start a discussion thread on > KIP-1101. > >>>>>>>>>>>>>> Trigger > >>>>>>>>>>>>>>>>>> rebalance > >>>>>>>>>>>>>>>>>>>>>>> on rack topology changes. In this KIP, we aim to > use > >>>>> less > >>>>>>>>>>>>>> memory / > >>>>>>>>>>>>>>>>>> disk > >>>>>>>>>>>>>>>>>>>>>>> resources to detect rack changes in the new > >>>>> coordinator. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-1101%3A+Trigger+rebalance+on+rack+topology+changes > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Please take a look and feel free to share any > thoughts. > >>>>>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>>>>> Thanks. > >>>>>>>>>>>>>>>>>>>>>>> PoAn > >