[jira] [Updated] (KAFKA-19181) Investigate system test failures

Chirag Wadhwa (Jira) Thu, 24 Apr 2025 01:04:48 -0700


     [ 
https://issues.apache.org/jira/browse/KAFKA-19181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chirag Wadhwa updated KAFKA-19181:
----------------------------------
    Description: 
The nightly runs for system tests has picked failures for the following 2 tests 
- 

1) test_share_multiple_partitions 

2) test_broker_failure

Investigation analysis - 

1) for the first test, 3 brokers are run, with 3 share consumers (all part of 
same group). A million messages are produced to a topic with 3 partitions. Once 
the messages are produced and consumed, the assertions include checks that all 
consumers include some messages for all share partitions. But with the new 
SimpleAssignor algorithm in place, some consumers are not assigned some 
partitions, so they don't consume from those share partitions, resulting into 
assertion failures.

Fix - change the test to first find the assignment of the consumers using the 
kafka-share-groups.sh --describe command and only include assertions for 
assigned share partitions

 

2) The bug is introduced when the coordinator is not active when an initialize 
request is under process. The consumer sends heartbeats to the broker, but none 
of them are successful. Initially a few of them are failing because of 
{{COORDINATOR_NOT_AVAILABLE}} error. This is expected and should be fine 
because this is a transient error. But during this time, the broker keeps 
updating the memberEpoch for the member. But the response sent back to the 
member has a memberEpoch as 0. Now I understand that client depends on the 
broker's response to update its memberEpoch, and thus the subsequent requests 
are also sent with memberEpoch 0. This happens for a couple of requests (broker 
keeps increasing the memberEpoch but sends back a {{COORDINATOR_NOT_AVAILABLE}} 
 error with memberEpoch as 0 ). Finally when the coordinator is active, as 
expected we get a {{FENCED_MEMBER_EPOCH}} exception. And now, the member keeps 
of sending heartbeat with the wrong memberEpoch and the broker keeps on sending 
back {{FENCED_MEMBER_EPOCH}}  exception.

  was:
The nightly runs for system tests has picked failures for the following 2 tests 
- 

1) test_share_multiple_partitions 

2) test_broker_failure

Investigation analysis - 

1) for the first test, 3 brokers are run, with 3 share consumers (all part of 
same group). A million messages are produced to a topic with 3 partitions. Once 
the messages are produced and consumed, the assertions include checks that all 
consumers include some messages for all share partitions. But with the new 
SimpleAssignor algorithm in place, some consumers are not assigned some 
partitions, so they don't consume from those share partitions, resulting into 
assertion failures.

Fix - change the test to first find the assignment of the consumers using the 
kafka-share-groups.sh --describe command and only include assertions for 
assigned share partitions

 

2) In progress


> Investigate system test failures
> --------------------------------
>
>                 Key: KAFKA-19181
>                 URL: https://issues.apache.org/jira/browse/KAFKA-19181
>             Project: Kafka
>          Issue Type: Sub-task
>            Reporter: Chirag Wadhwa
>            Assignee: Chirag Wadhwa
>            Priority: Major
>
> The nightly runs for system tests has picked failures for the following 2 
> tests - 
> 1) test_share_multiple_partitions 
> 2) test_broker_failure
> Investigation analysis - 
> 1) for the first test, 3 brokers are run, with 3 share consumers (all part of 
> same group). A million messages are produced to a topic with 3 partitions. 
> Once the messages are produced and consumed, the assertions include checks 
> that all consumers include some messages for all share partitions. But with 
> the new SimpleAssignor algorithm in place, some consumers are not assigned 
> some partitions, so they don't consume from those share partitions, resulting 
> into assertion failures.
> Fix - change the test to first find the assignment of the consumers using the 
> kafka-share-groups.sh --describe command and only include assertions for 
> assigned share partitions
>  
> 2) The bug is introduced when the coordinator is not active when an 
> initialize request is under process. The consumer sends heartbeats to the 
> broker, but none of them are successful. Initially a few of them are failing 
> because of {{COORDINATOR_NOT_AVAILABLE}} error. This is expected and should 
> be fine because this is a transient error. But during this time, the broker 
> keeps updating the memberEpoch for the member. But the response sent back to 
> the member has a memberEpoch as 0. Now I understand that client depends on 
> the broker's response to update its memberEpoch, and thus the subsequent 
> requests are also sent with memberEpoch 0. This happens for a couple of 
> requests (broker keeps increasing the memberEpoch but sends back a 
> {{COORDINATOR_NOT_AVAILABLE}}  error with memberEpoch as 0 ). Finally when 
> the coordinator is active, as expected we get a {{FENCED_MEMBER_EPOCH}} 
> exception. And now, the member keeps of sending heartbeat with the wrong 
> memberEpoch and the broker keeps on sending back {{FENCED_MEMBER_EPOCH}}  
> exception.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-19181) Investigate system test failures

Reply via email to