[jira] [Comment Edited] (KAFKA-13419) sync group failed with rebalanceInProgress error might cause out-of-date ownedPartition in Cooperative protocol

Shawn Wang (Jira) Thu, 16 Jun 2022 23:13:06 -0700


    [ 
https://issues.apache.org/jira/browse/KAFKA-13419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17555390#comment-17555390
 ]


Shawn Wang edited comment on KAFKA-13419 at 6/17/22 6:12 AM:
-------------------------------------------------------------

Hi [~showuon] 

After i applied this fix and my previous change to make this fix 
work[https://github.com/apache/kafka/pull/12140, 
|https://github.com/apache/kafka/pull/12140]what we are seeing is that: 
sometimes consumer will revoker almost all partitions with cooperative enabled.

detail:
 * we have more than 1000 consumers, coopeartive rebalance. 
 * Just the same as the example in this JIRA:  in cooperative rebalance some 
consumer will do a very quick re-join after get SyncGroupResponse. if there are 
some consumer that didn't send SyncGroupRequest yet, it will do a revoke-all 
and re-join operation.
 * after applied this change, it will solve the rebalance many rounds problem
 * but it will result in many partitions revoked if there is a very fast 
re-join consumer, and make cooperative almost the same as eager rebalance.

So instead of "{*}resetStateAndRejoin{*} when *RebalanceInProgressException* 
errors happend in {*}sync group{*}", can we just treat the ownedPartition in 
previous generation legal if there are no same partition claimed by other 
member? 

 

What do you think?

Thanks a lot!


was (Author: JIRAUSER289108):
Hi [~showuon] 

After i applied this fix and my previous change to make this fix work[Pull 
Request|[https://github.com/apache/kafka/pull/12140]|https://github.com/apache/kafka/pull/12140),]what
 we are seeing is that: sometimes consumer will revoker almost all partitions 
with cooperative enabled.

detail:
 * we have more than 1000 consumers, coopeartive rebalance. 
 * Just the same as the example in this JIRA:  in cooperative rebalance some 
consumer will do a very quick re-join after get SyncGroupResponse. if there are 
some consumer that didn't send SyncGroupRequest yet, it will do a revoke-all 
and re-join operation.
 * after applied this change, it will solve the rebalance many rounds problem
 * but it will result in many partitions revoked if there is a very fast 
re-join consumer, and make cooperative almost the same as eager rebalance.

So instead of "{*}resetStateAndRejoin{*} when *RebalanceInProgressException* 
errors happend in {*}sync group{*}", can we just treat the ownedPartition in 
previous generation legal if there are no same partition claimed by other 
member? 

 

What do you think?

Thanks a lot!

> sync group failed with rebalanceInProgress error might cause out-of-date 
> ownedPartition in Cooperative protocol
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-13419
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13419
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients
>    Affects Versions: 3.0.0
>            Reporter: Luke Chen
>            Assignee: Luke Chen
>            Priority: Major
>             Fix For: 3.1.0
>
>
> In KAFKA-13406, we found there's user got stuck when in rebalancing with 
> cooperative sticky assignor. The reason is the "ownedPartition" is 
> out-of-date, and it failed the cooperative assignment validation.
> Investigate deeper, I found the root cause is we didn't reset generation and 
> state after sync group fail. In KAFKA-12983, we fixed the issue that the 
> onJoinPrepare is not called in resetStateAndRejoin method. And it causes the 
> ownedPartition not get cleared. But there's another case that the 
> ownedPartition will be out-of-date. Here's the example:
>  # consumer A joined and synced group successfully with generation 1
>  # New rebalance started with generation 2, consumer A joined successfully, 
> but somehow, consumer A doesn't send out sync group immediately
>  # other consumer completed sync group successfully in generation 2, except 
> consumer A.
>  # After consumer A send out sync group, the new rebalance start, with 
> generation 3. So consumer A got REBALANCE_IN_PROGRESS error with sync group 
> response
>  # When receiving REBALANCE_IN_PROGRESS, we re-join the group, with 
> generation 3, with the assignment (ownedPartition) in generation 1.
>  # So, now, we have out-of-date ownedPartition sent, with unexpected results 
> happened
>  
> We might want to do *resetStateAndRejoin* when *RebalanceInProgressException* 
> errors happend in *sync group*. Because when we got sync group error, it 
> means, join group passed, and other consumers (and the leader) might already 
> completed this round of rebalance. The assignment distribution this consumer 
> have is already out-of-date.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Comment Edited] (KAFKA-13419) sync group failed with rebalanceInProgress error might cause out-of-date ownedPartition in Cooperative protocol

Reply via email to