C0urante commented on PR #12366: URL: https://github.com/apache/kafka/pull/12366#issuecomment-1295343296
It turns out that the `testReplication` flakiness persisted in Jenkins, and was not solved by increasing timeouts. Instead, the root of the problem was a change in the Connect framework's behavior when exactly-once support is enabled. Without exactly-once support, `SourceTask::commitRecord` is invoked [as soon as the record is ack'd](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/WorkerSourceTask.java#L148) by the Kafka cluster, which usually causes those calls to be spread out over time. With exactly-once support, `SourceTask::commitRecord` is invoked [for every record in a transaction](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/ExactlyOnceWorkerSourceTask.java#L312) once that transaction is committed, which causes a rapid series of calls to take place one after the other. MirrorMaker 2 triggers a (potential) offset sync after every call to `commitRecord`, but it has [logic to prevent too many outstanding offset syncs from accruing](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java#L208-L211). The exact limit on the number of outstanding offset requests [is ten](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorSourceTask.java#L53), which is less than the total number of topic partitions being replicated during the integration test. As a result, the test became flaky, since sometimes MM2 would drop an offset sync for partition 0 of the `test-topic-1` and then fail when [checking for offset syncs for that topic partition](https://github.com/apache/kafka/blob/c1bb307a361b7b6e17261cc84ea2b108bacca84d/connect/mirror/src/test/java/org/apache/kafka/conn ect/mirror/integration/MirrorConnectorsIntegrationBaseTest.java#L319-L320). Since the behavior change in the Connect framework may have an impact on MirrorMaker 2 outside of testing environments, I've tweaked the offset sync limit to apply on a per-topic-partition basis. This way, if a flurry of calls to `commitRecord` takes place when a transaction is committed, every topic partition should still get a chance for an offset sync, but there is still an upper bound on the number of outstanding offset syncs (although that bound is now proportional to the number of topic partitions being replicated, instead of a constant). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: jira-unsubscr...@kafka.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org