[ 
https://issues.apache.org/jira/browse/KAFKA-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17434389#comment-17434389
 ] 

Luke Chen edited comment on KAFKA-12984 at 10/26/21, 2:11 PM:
--------------------------------------------------------------

[~Andy_Dufresne], thanks for the log. I have some clue from there now.

 

[~ableegoldman], here's my observation from the log.

At first, the rebalance (generation 24) completed successfully. Here's the 
final assignment:
{code:java}
- 
qa-qa-cf-executor-transform-e96cc4a8-e2a4-447a-b033-eb5b244a329a=[qa-qa-cf-events-8,
 qa-qa-cf-events-9, qa-qa-cf-events-24, qa-qa-cf-events-35], 
- 
qa-qa-cf-executor-transform-6f9acbca-885e-435c-97f5-6c86f476bd8e=[qa-qa-cf-events-1,
 qa-qa-cf-events-3, qa-qa-cf-events-5, qa-qa-cf-events-7], 
- 
qa-qa-cf-executor-transform-84d0f50e-0c97-4bf7-b4cc-311c6d3fbdf9=[qa-qa-cf-events-10,
 qa-qa-cf-events-11, qa-qa-cf-events-25], 
- 
qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[qa-qa-cf-events-6,
 qa-qa-cf-events-30, qa-qa-cf-events-32], 
- 
qa-qa-cf-executor-transform-c7decca7-e9d1-4d0b-b78e-3dfa5d71ea29=[qa-qa-cf-events-26,
 qa-qa-cf-events-28, qa-qa-cf-events-29], 
- 
qa-qa-cf-executor-transform-ac907fc0-33c9-4dfc-8f4d-f0e10b39e8ab=[qa-qa-cf-events-31,
 qa-qa-cf-events-36, qa-qa-cf-events-37], 
- 
qa-qa-cf-executor-transform-4c3d7d78-c1de-4d2d-bd18-580a442bc719=[qa-qa-cf-events-34,
 qa-qa-cf-events-38, qa-qa-cf-events-39], 
- 
qa-qa-cf-executor-transform-0e14afde-0788-44ef-8b5e-af7182f6a762=[qa-qa-cf-events-14,
 qa-qa-cf-events-15, qa-qa-cf-events-33], 
- 
qa-qa-cf-executor-transform-41f4f5c4-da68-40f1-911d-d006ede5c464=[qa-qa-cf-events-16,
 qa-qa-cf-events-17, qa-qa-cf-events-18, qa-qa-cf-events-19], qa-- - 
qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[qa-qa-cf-events-20,
 qa-qa-cf-events-21, qa-qa-cf-events-22, qa-qa-cf-events-23], qa-- - 
qa-cf-executor-transform-5a95d1d8-0a47-4d32-bb6b-03531bb92765=[qa-qa-cf-events-0,
 qa-qa-cf-events-2, qa-qa-cf-events-4], 
- 
qa-qa-cf-executor-transform-2b6f692a-b557-4cc3-b461-f55fa3e25000=[qa-qa-cf-events-12,
 qa-qa-cf-events-13, qa-qa-cf-events-27]{code}
 

Here, what I want to highlight is the the assignment for consumer: 
qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[**qa-qa-cf-events-20,
 qa-qa-cf-events-21, qa-qa-cf-events-22, qa-qa-cf-events-23**]. This one causes 
the issue.

Also, the other consumer: 
qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[**qa-qa-cf-events-6,
 qa-qa-cf-events-30, qa-qa-cf-events-32**], it seems not get the final 
assignment in generation 24. So in the next round of rebalance, ownedPartition 
is empty (my guess)

And, what will happen next (my guess), is that, in subscription's 
"ownedPartition" variable, we remember these 4 ownedPartition in subscription 
for the consumer ending with "bf97ee2b2a50" above. But somehow, we didn't find 
them in subscription data field after deserialization (maybe the generation is 
not the highest). So, in the following assignment, we'll just assign the 4 
partitions to 2 consumers (1 is the consumer itself). And the error message in 
validateCooperativeAssignment proves my suspicion.

 

Next, generation 25 rebalance starts, with 3 retries. The all 3 final 
assignments are also logged, and here, I highlighted the difference:
 * 
 -- 1st try*, with error:
 _With the COOPERATIVE protocol, owned partitions cannot be reassigned to other 
members; however the assignor has reassigned partitions [*qa-qa-cf-events-23, 
qa-qa-cf-events-21*] which are still owned by some members_

...

qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[qa-qa-cf-events-6,
 *qa-qa-cf-events-21*, *qa-qa-cf-events-23*]

...

qa-qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[*qa-qa-cf-events-20,
 qa-qa-cf-events-22*, qa-qa-cf-events-30]

 
 * 
 -- 2nd try*, with the same error as 1st try, because the assignment is the 
same.

...

qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[qa-qa-cf-events-6,
 *qa-qa-cf-events-21, qa-qa-cf-events-23*]

...

qa-qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[*qa-qa-cf-events-20,
 qa-qa-cf-events-22*, qa-qa-cf-events-30]

 
 * 
 -- 3rd try*, with error:
 _With the COOPERATIVE protocol, owned partitions cannot be reassigned to other 
members; however the assignor has reassigned partitions [*qa-qa-cf-events-20, 
qa-qa-cf-events-23, qa-qa-cf-events-22*] which are still owned by some members_

...

qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[qa-qa-cf-events-6,
 *qa-qa-cf-events-20, qa-qa-cf-events-23*]

...

qa-qa-cf-executor-transform-d914fca8-dfe7-420c-98f7-ce7c44727fcd=[qa-qa-cf-events-19,
 *qa-qa-cf-events-22*, qa-qa-cf-events-32]  <-- new consumer

...

qa-qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[qa-qa-cf-events-7,
 *qa-qa-cf-events-21*, qa-qa-cf-events-30]

 

 

====

So, it looks like the out-of-date "ownedPartition" in subscription not only 
might cause the double assignment issue, but also failed the cooperation 
assignment validation. 

My suggestion:
 # in validateCooperationAssignment, we should deserialize the subscription 
userData, instead of using the ownedPartition directly.
 # If the assignor is our built-in assignor (i.e. CooperativeStickyAssignor), 
we ignore the validation.

What do you think?


was (Author: showuon):
[~Andy_Dufresne], thanks for the log. I have some clue from there now.

 

[~ableegoldman], here's my observation from the log.

At first, the rebalance (generation 24) completed successfully. Here's the 
final assignment:
{code:java}
- 
qa-qa-cf-executor-transform-e96cc4a8-e2a4-447a-b033-eb5b244a329a=[qa-qa-cf-events-8,
 qa-qa-cf-events-9, qa-qa-cf-events-24, qa-qa-cf-events-35], 
- 
qa-qa-cf-executor-transform-6f9acbca-885e-435c-97f5-6c86f476bd8e=[qa-qa-cf-events-1,
 qa-qa-cf-events-3, qa-qa-cf-events-5, qa-qa-cf-events-7], 
- 
qa-qa-cf-executor-transform-84d0f50e-0c97-4bf7-b4cc-311c6d3fbdf9=[qa-qa-cf-events-10,
 qa-qa-cf-events-11, qa-qa-cf-events-25], 
- 
qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[qa-qa-cf-events-6,
 qa-qa-cf-events-30, qa-qa-cf-events-32], 
- 
qa-qa-cf-executor-transform-c7decca7-e9d1-4d0b-b78e-3dfa5d71ea29=[qa-qa-cf-events-26,
 qa-qa-cf-events-28, qa-qa-cf-events-29], 
- 
qa-qa-cf-executor-transform-ac907fc0-33c9-4dfc-8f4d-f0e10b39e8ab=[qa-qa-cf-events-31,
 qa-qa-cf-events-36, qa-qa-cf-events-37], 
- 
qa-qa-cf-executor-transform-4c3d7d78-c1de-4d2d-bd18-580a442bc719=[qa-qa-cf-events-34,
 qa-qa-cf-events-38, qa-qa-cf-events-39], 
- 
qa-qa-cf-executor-transform-0e14afde-0788-44ef-8b5e-af7182f6a762=[qa-qa-cf-events-14,
 qa-qa-cf-events-15, qa-qa-cf-events-33], 
- 
qa-qa-cf-executor-transform-41f4f5c4-da68-40f1-911d-d006ede5c464=[qa-qa-cf-events-16,
 qa-qa-cf-events-17, qa-qa-cf-events-18, qa-qa-cf-events-19], qa-- - 
qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[qa-qa-cf-events-20,
 qa-qa-cf-events-21, qa-qa-cf-events-22, qa-qa-cf-events-23], qa-- - 
qa-cf-executor-transform-5a95d1d8-0a47-4d32-bb6b-03531bb92765=[qa-qa-cf-events-0,
 qa-qa-cf-events-2, qa-qa-cf-events-4], 
- 
qa-qa-cf-executor-transform-2b6f692a-b557-4cc3-b461-f55fa3e25000=[qa-qa-cf-events-12,
 qa-qa-cf-events-13, qa-qa-cf-events-27]{code}
 

Here, what I want to highlight is the the assignment for consumer: 
qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[*qa-qa-cf-events-20,
 qa-qa-cf-events-21, qa-qa-cf-events-22, qa-qa-cf-events-23*]. This one causes 
the issue.

Also, the other consumer: 
qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[*qa-qa-cf-events-6,
 qa-qa-cf-events-30, qa-qa-cf-events-32*], it seems not get the final 
assignment in generation 24. So in the next round of rebalance, ownedPartition 
is empty (my guess)

And, what will happen next (my guess), is that, in subscription's 
"ownedPartition" variable, we remember these 4 ownedPartition in subscription 
for the consumer ending with "bf97ee2b2a50" above. But somehow, we didn't find 
them in subscription data field after deserialization (maybe the generation is 
not the highest). So, in the following assignment, we'll just assign the 4 
partitions to 2 consumers (1 is the consumer itself). And the error message in 
validateCooperativeAssignment proves my suspicion.

 

Next, generation 25 rebalance starts, with 3 retries. The all 3 final 
assignments are also logged, and here, I highlighted the difference:

*- 1st try*, with error: 
_With the COOPERATIVE protocol, owned partitions cannot be reassigned to other 
members; however the assignor has reassigned partitions [*qa-qa-cf-events-23, 
qa-qa-cf-events-21*] which are still owned by some members_

...

qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[qa-qa-cf-events-6,
 *qa-qa-cf-events-21*, *qa-qa-cf-events-23*]

...

qa-qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[*qa-qa-cf-events-20,
 qa-qa-cf-events-22*, qa-qa-cf-events-30]

 

*- 2nd try*, with the same error as 1st try, because the assignment is the same.

...

qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[qa-qa-cf-events-6,
 *qa-qa-cf-events-21, qa-qa-cf-events-23*]

...

qa-qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[*qa-qa-cf-events-20,
 qa-qa-cf-events-22*, qa-qa-cf-events-30]

 

*- 3rd try*, with error:
_With the COOPERATIVE protocol, owned partitions cannot be reassigned to other 
members; however the assignor has reassigned partitions [*qa-qa-cf-events-20, 
qa-qa-cf-events-23, qa-qa-cf-events-22*] which are still owned by some members_

...

qa-qa-cf-executor-transform-199c5bfb-85c8-4ecc-9604-30eb7988b9fe=[qa-qa-cf-events-6,
 *qa-qa-cf-events-20, qa-qa-cf-events-23*]

...

qa-qa-cf-executor-transform-d914fca8-dfe7-420c-98f7-ce7c44727fcd=[qa-qa-cf-events-19,
 *qa-qa-cf-events-22*, qa-qa-cf-events-32]  <-- new consumer

...

qa-qa-cf-executor-transform-b1777322-80ee-4f28-a224-bf97ee2b2a50=[qa-qa-cf-events-7,
 *qa-qa-cf-events-21*, qa-qa-cf-events-30]

 

 

====

So, it looks like the out-of-date "ownedPartition" in subscription not only 
might cause the double assignment issue, but also failed the cooperation 
assignment validation. 

My suggestion:
 # in validateCooperationAssignment, we should deserialize the subscription 
userData, instead of using the ownedPartition directly.
 # If the assignor is our built-in assignor (i.e. CooperativeStickyAssignor), 
we ignore the validation.

What do you think?

> Cooperative sticky assignor can get stuck with invalid SubscriptionState 
> input metadata
> ---------------------------------------------------------------------------------------
>
>                 Key: KAFKA-12984
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12984
>             Project: Kafka
>          Issue Type: Bug
>          Components: consumer
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: A. Sophie Blee-Goldman
>            Priority: Blocker
>             Fix For: 2.8.1, 3.0.0
>
>         Attachments: image-2021-10-25-11-53-40-221.png, 
> log-events-viewer-result-kafka.numbers, logs-insights-results-kafka.csv, 
> logs-insights-results-kafka.numbers
>
>
> Some users have reported seeing their consumer group become stuck in the 
> CompletingRebalance phase when using the cooperative-sticky assignor. Based 
> on the request metadata we were able to deduce that multiple consumers were 
> reporting the same partition(s) in their "ownedPartitions" field of the 
> consumer protocol. Since this is an invalid state, the input causes the 
> cooperative-sticky assignor to detect that something is wrong and throw an 
> IllegalStateException. If the consumer application is set up to simply retry, 
> this will cause the group to appear to hang in the rebalance state.
> The "ownedPartitions" field is encoded based on the ConsumerCoordinator's 
> SubscriptionState, which was assumed to always be up to date. However there 
> may be cases where the consumer has dropped out of the group but fails to 
> clear the SubscriptionState, allowing it to report some partitions as owned 
> that have since been reassigned to another member.
> We should (a) fix the sticky assignment algorithm to resolve cases of 
> improper input conditions by invalidating the "ownedPartitions" in cases of 
> double ownership, and (b) shore up the ConsumerCoordinator logic to better 
> handle rejoining the group and keeping its internal state consistent. See 
> KAFKA-12983 for more details on (b)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to