[ https://issues.apache.org/jira/browse/KAFKA-13419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555390#comment-17555390 ]
Shawn Wang edited comment on KAFKA-13419 at 6/17/22 6:12 AM: ------------------------------------------------------------- Hi [~showuon] After i applied this fix and my previous change to make this fix work[https://github.com/apache/kafka/pull/12140, |https://github.com/apache/kafka/pull/12140]what we are seeing is that: sometimes consumer will revoker almost all partitions with cooperative enabled. detail: * we have more than 1000 consumers, coopeartive rebalance. * Just the same as the example in this JIRA: in cooperative rebalance some consumer will do a very quick re-join after get SyncGroupResponse. if there are some consumer that didn't send SyncGroupRequest yet, it will do a revoke-all and re-join operation. * after applied this change, it will solve the rebalance many rounds problem * but it will result in many partitions revoked if there is a very fast re-join consumer, and make cooperative almost the same as eager rebalance. So instead of "{*}resetStateAndRejoin{*} when *RebalanceInProgressException* errors happend in {*}sync group{*}", can we just treat the ownedPartition in previous generation legal if there are no same partition claimed by other member? What do you think? Thanks a lot! was (Author: JIRAUSER289108): Hi [~showuon] After i applied this fix and my previous change to make this fix work[Pull Request|[https://github.com/apache/kafka/pull/12140]|https://github.com/apache/kafka/pull/12140),]what we are seeing is that: sometimes consumer will revoker almost all partitions with cooperative enabled. detail: * we have more than 1000 consumers, coopeartive rebalance. * Just the same as the example in this JIRA: in cooperative rebalance some consumer will do a very quick re-join after get SyncGroupResponse. if there are some consumer that didn't send SyncGroupRequest yet, it will do a revoke-all and re-join operation. * after applied this change, it will solve the rebalance many rounds problem * but it will result in many partitions revoked if there is a very fast re-join consumer, and make cooperative almost the same as eager rebalance. So instead of "{*}resetStateAndRejoin{*} when *RebalanceInProgressException* errors happend in {*}sync group{*}", can we just treat the ownedPartition in previous generation legal if there are no same partition claimed by other member? What do you think? Thanks a lot! > sync group failed with rebalanceInProgress error might cause out-of-date > ownedPartition in Cooperative protocol > --------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-13419 > URL: https://issues.apache.org/jira/browse/KAFKA-13419 > Project: Kafka > Issue Type: Bug > Components: clients > Affects Versions: 3.0.0 > Reporter: Luke Chen > Assignee: Luke Chen > Priority: Major > Fix For: 3.1.0 > > > In KAFKA-13406, we found there's user got stuck when in rebalancing with > cooperative sticky assignor. The reason is the "ownedPartition" is > out-of-date, and it failed the cooperative assignment validation. > Investigate deeper, I found the root cause is we didn't reset generation and > state after sync group fail. In KAFKA-12983, we fixed the issue that the > onJoinPrepare is not called in resetStateAndRejoin method. And it causes the > ownedPartition not get cleared. But there's another case that the > ownedPartition will be out-of-date. Here's the example: > # consumer A joined and synced group successfully with generation 1 > # New rebalance started with generation 2, consumer A joined successfully, > but somehow, consumer A doesn't send out sync group immediately > # other consumer completed sync group successfully in generation 2, except > consumer A. > # After consumer A send out sync group, the new rebalance start, with > generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group > response > # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with > generation 3, with the assignment (ownedPartition) in generation 1. > # So, now, we have out-of-date ownedPartition sent, with unexpected results > happened > > We might want to do *resetStateAndRejoin* when *RebalanceInProgressException* > errors happend in *sync group*. Because when we got sync group error, it > means, join group passed, and other consumers (and the leader) might already > completed this round of rebalance. The assignment distribution this consumer > have is already out-of-date. > -- This message was sent by Atlassian Jira (v8.20.7#820007)