[ https://issues.apache.org/jira/browse/KAFKA-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881173#comment-17881173 ]
Sagar Rao commented on KAFKA-17493: ----------------------------------- [~ChrisEgerton] , sorry my bad. Yes I do see that the ListOffsets call keeps returning empty offsets till the timeout happens. I grepped the Group Coordinator logs for the flaky and non flaky cases and what I notice is that in the flaky case, the consumer group of the sink task never got to 2 members in the group. These are the lines from the flaky test: {code:java} [2024-09-06 21:59:57,843] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Dynamic member with unknown member id joins group connect-testGetSinkConnectorOffsets in Empty state. Created a new member id connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee and requesting the member to rejoin with this id. (org.apache.kafka.coordinator.group.GroupMetadataManager:4111) [2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Dynamic member with unknown member id joins group connect-testGetSinkConnectorOffsets in Empty state. Created a new member id connector-consumer-testGetSinkConnectorOffsets-1-4c065deb-6771-427d-902b-be788543b7bd and requesting the member to rejoin with this id. (org.apache.kafka.coordinator.group.GroupMetadataManager:4111) [2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets in state PreparingRebalance with old generation 0 (reason: Adding new member connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee with group instance id null; client reason: need to re-join with the given member-id: connector-consumer-testGetSinkConnectorOffsets-0-a47aa5b3-d9d8-4aa6-ab30-7f79c971b6ee). (org.apache.kafka.coordinator.group.GroupMetadataManager:4673) [2024-09-06 21:59:57,844] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 1 with 1 members. (org.apache.kafka.coordinator.group.GroupMetadataManager:4383){code} Even though 2 members tried to join , eventually the group never saw the stable group with 2 members. If we contrast this with the passing case: {code:java} [2024-09-06 22:22:47,577] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Dynamic member with unknown member id joins group connect-testGetSinkConnectorOffsets in Empty state. Created a new member id connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd and requesting the member to rejoin with this id. (org.apache.kafka.coordinator.group.GroupMetadataManager:4111) [2024-09-06 22:22:47,579] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets in state PreparingRebalance with old generation 0 (reason: Adding new member connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd with group instance id null; client reason: need to re-join with the given member-id: connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd). (org.apache.kafka.coordinator.group.GroupMetadataManager:4673) [2024-09-06 22:22:47,580] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 1 with 1 members. (org.apache.kafka.coordinator.group.GroupMetadataManager:4383) [2024-09-06 22:22:47,580] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Dynamic member with unknown member id joins group connect-testGetSinkConnectorOffsets in CompletingRebalance state. Created a new member id connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3 and requesting the member to rejoin with this id. (org.apache.kafka.coordinator.group.GroupMetadataManager:4111) [2024-09-06 22:22:47,581] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Assignment received from leader connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd for group connect-testGetSinkConnectorOffsets for generation 1. The group has 1 members, 0 of which are static. (org.apache.kafka.coordinator.group.GroupMetadataManager:5142) [2024-09-06 22:22:47,582] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Preparing to rebalance group connect-testGetSinkConnectorOffsets in state PreparingRebalance with old generation 1 (reason: Adding new member connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3 with group instance id null; client reason: need to re-join with the given member-id: connector-consumer-testGetSinkConnectorOffsets-0-b5112649-3008-432e-a1eb-a10593d049b3). (org.apache.kafka.coordinator.group.GroupMetadataManager:4673) [2024-09-06 22:22:47,583] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Stabilized group connect-testGetSinkConnectorOffsets generation 2 with 2 members. (org.apache.kafka.coordinator.group.GroupMetadataManager:4383) [2024-09-06 22:22:47,584] INFO [GroupCoordinator id=0 topic=__consumer_offsets partition=45] Assignment received from leader connector-consumer-testGetSinkConnectorOffsets-1-a6cc10ec-9258-4293-8b9d-d240fe89e4fd for group connect-testGetSinkConnectorOffsets for generation 2. The group has 2 members, 0 of which are static. (org.apache.kafka.coordinator.group.GroupMetadataManager:5142) {code} So this seems in line with what Chris mentioned above. I am attaching the grepped Group Coordinator logs for reference. One difference between the 2 cases[^flaky-tests-gc.txt] is that as I had mentioned in the above note as well, for the flaky test, we are reusing an existing connect/kafka cluster where we need to delete the existing topic etc while in the passing test, everything is afresh. > Sink connector-related OffsetsApiIntegrationTest suite test cases failing > more frequently with new consumer/group coordinator > ----------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-17493 > URL: https://issues.apache.org/jira/browse/KAFKA-17493 > Project: Kafka > Issue Type: Test > Components: connect, consumer, group-coordinator > Reporter: Chris Egerton > Priority: Major > Attachments: flaky-tests-gc.txt, passing-tests-gc.txt > > > We recently updated trunk to use the new KIP-848 consumer/group coordinator > by default, which appears to have led to an uptick in flakiness for the > OffsetsApiIntegrationTest suite for Connect (specifically, the test cases > that use sink connectors, which makes sense since they're the type of > connector that uses a consumer group under the hood). > Gradle Enterprise shows that in the week before that update was made, the > test suite had a flakiness rate of about 4% > (https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1724558400000&search.startTimeMin=1723953600000&search.tags=trunk&search.timeZoneId=America%2FNew_York&tests.container=org.apache.kafka.connect.integration.*&tests.sortField=FLAKY), > and in the week and a half since, the flakiness rate has jumped to 17% > (https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1725681599999&search.startTimeMin=1724731200000&search.tags=trunk&search.timeZoneId=America%2FNew_York&tests.container=org.apache.kafka.connect.integration.*&tests.sortField=FLAKY). -- This message was sent by Atlassian Jira (v8.20.10#820010)