[ https://issues.apache.org/jira/browse/KAFKA-17493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881001#comment-17881001 ]
Sagar Rao edited comment on KAFKA-17493 at 9/11/24 3:47 PM: ------------------------------------------------------------ [~dajac] , [~ChrisEgerton] I took a look at the logs for [testGetSinkConnectorOffsets|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1725681599999&search.startTimeMin=1724731200000&search.tags=trunk&search.timeZoneId=America%2FNew_York&tests.container=org.apache.kafka.connect.integration.OffsetsApiIntegrationTest&tests.sortField=FLAKY&tests.test=testGetSinkConnectorOffsets()]. I noticed a couple of differences which which may contribute to the flakiness (not totally sure at this point): 1) For the passed test case, I see that when the test passes, at that point we are spinning up a new connect cluster. When that happens, I see [verifyClusterReadiness|https://github.com/apache/kafka/blob/trunk/connect/runtime/src/test/java/org/apache/kafka/connect/util/clusters/EmbeddedKafkaCluster.java#L181] getting triggered which checks whether the kafka cluster is ready or not and also an Admin client is able to do admin stuff. In the failing case, I see we don't have that and instead we reuse an existing connect cluster as per [this|#L129].] 2) In the failed test, the connector comes up properly till this point, but it appears to me that it gets stuck when trying to read the offsets using the Admin client [here|https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L1234-L1252] I see the same line in the stacktrace as well ``` at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) at org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:397) at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:445) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:394) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:378) at org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.verifyExpectedSinkConnectorOffsets(OffsetsApiIntegrationTest.java:999) at org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.getAndVerifySinkConnectorOffsets(OffsetsApiIntegrationTest.java:226) at org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.testGetSinkConnectorOffsets(OffsetsApiIntegrationTest.java:173) at java.lang.reflect.Method.invoke(Method.java:569) at java.util.ArrayList.forEach(ArrayList.java:1511) at java.util.ArrayList.forEach(ArrayList.java:1511) ``` We are trying to use the AdminClient to read the sink connector offsets [here|#L1234-L1252].] There's not much indication in the logs as to why this is happening. was (Author: sagarrao): [~dajac] , [~ChrisEgerton] I took a look at the logs for [testGetSinkConnectorOffsets|https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1725681599999&search.startTimeMin=1724731200000&search.tags=trunk&search.timeZoneId=America%2FNew_York&tests.container=org.apache.kafka.connect.integration.OffsetsApiIntegrationTest&tests.sortField=FLAKY&tests.test=testGetSinkConnectorOffsets()]. I noticed a couple of differences which which may contribute to the flakiness (not totally sure at this point): 1) For the passed test case, I see that when the test passes, at that point we are spinning up a new connect cluster. When that happens, I see [verifyClusterReadiness|https://github.com/apache/kafka/blob/trunk/connect/runtime/src/test/java/org/apache/kafka/connect/util/clusters/EmbeddedKafkaCluster.java#L181] getting triggered which checks whether the kafka cluster is ready or not and also an Admin client is able to do admin stuff. In the failing case, I see we don't have that and instead we reuse an existing connect cluster as per [this|[https://github.com/apache/kafka/blob/trunk/connect/runtime/src/test/java/org/apache/kafka/connect/integration/OffsetsApiIntegrationTest.java#L129].] 2) In the failed test, the connector comes up properly till this point, but it appears to me that it gets stuck when trying to read the offsets using the Admin client [here|[https://github.com/apache/kafka/blob/trunk/connect/runtime/src/test/java/org/apache/kafka/connect/integration/OffsetsApiIntegrationTest.java#L226].] I see the same line in the stacktrace as well ``` at org.junit.jupiter.api.AssertionFailureBuilder.build(AssertionFailureBuilder.java:151) at org.junit.jupiter.api.AssertionFailureBuilder.buildAndThrow(AssertionFailureBuilder.java:132) at org.junit.jupiter.api.AssertTrue.failNotTrue(AssertTrue.java:63) at org.junit.jupiter.api.AssertTrue.assertTrue(AssertTrue.java:36) at org.junit.jupiter.api.Assertions.assertTrue(Assertions.java:214) at org.apache.kafka.test.TestUtils.lambda$waitForCondition$3(TestUtils.java:397) at org.apache.kafka.test.TestUtils.retryOnExceptionWithTimeout(TestUtils.java:445) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:394) at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:378) at org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.verifyExpectedSinkConnectorOffsets(OffsetsApiIntegrationTest.java:999) at org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.getAndVerifySinkConnectorOffsets(OffsetsApiIntegrationTest.java:226) at org.apache.kafka.connect.integration.OffsetsApiIntegrationTest.testGetSinkConnectorOffsets(OffsetsApiIntegrationTest.java:173) at java.lang.reflect.Method.invoke(Method.java:569) at java.util.ArrayList.forEach(ArrayList.java:1511) at java.util.ArrayList.forEach(ArrayList.java:1511) ``` We are trying to use the AdminClient to read the sink connector offsets [here|[https://github.com/apache/kafka/blob/trunk/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/Worker.java#L1234-L1252].] There's not much indication in the logs as to why this is happening. > Sink connector-related OffsetsApiIntegrationTest suite test cases failing > more frequently with new consumer/group coordinator > ----------------------------------------------------------------------------------------------------------------------------- > > Key: KAFKA-17493 > URL: https://issues.apache.org/jira/browse/KAFKA-17493 > Project: Kafka > Issue Type: Test > Components: connect, consumer, group-coordinator > Reporter: Chris Egerton > Priority: Major > > We recently updated trunk to use the new KIP-848 consumer/group coordinator > by default, which appears to have led to an uptick in flakiness for the > OffsetsApiIntegrationTest suite for Connect (specifically, the test cases > that use sink connectors, which makes sense since they're the type of > connector that uses a consumer group under the hood). > Gradle Enterprise shows that in the week before that update was made, the > test suite had a flakiness rate of about 4% > (https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1724558400000&search.startTimeMin=1723953600000&search.tags=trunk&search.timeZoneId=America%2FNew_York&tests.container=org.apache.kafka.connect.integration.*&tests.sortField=FLAKY), > and in the week and a half since, the flakiness rate has jumped to 17% > (https://ge.apache.org/scans/tests?search.rootProjectNames=kafka&search.startTimeMax=1725681599999&search.startTimeMin=1724731200000&search.tags=trunk&search.timeZoneId=America%2FNew_York&tests.container=org.apache.kafka.connect.integration.*&tests.sortField=FLAKY). -- This message was sent by Atlassian Jira (v8.20.10#820010)