[
https://issues.apache.org/jira/browse/KAFKA-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sanskar Jhajharia updated KAFKA-19840:
--------------------------------------
Description:
The test has been observed as flaky in the recent runs:
[https://develocity.apache.org/scans/tests?search.names=CI%20workflow%2CGit%20repository&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=Asia%2FCalcutta&search.values=CI%2Chttps:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.clients.consumer.ShareConsumerTest&tests.sortField=FLAKY&tests.test=testShareGroupMaxSizeConfigExceeded()%5B1%5D]
After tracing the failures, the root cause was narrowed down to commit
[https://github.com/apache/kafka/commit/87657fdfc721055835f5b1f22151c461e85eab4a].
This change introduced a new try/catch around
{{handleCompletedAcknowledgements()}} in {{{}ShareConsumerImpl.java{}}}. The
side effect is that all exceptions coming from the commit callback — including
GroupMaxSizeReachedException — are now swallowed and only logged, preventing
the test from ever receiving the exception.
- If the ack callback fires while {{poll()}} is executing → exception is caught
& swallowed → test times out
- If the callback fires outside that path → exception escapes → test passes
So the same test randomly passes/fails depending on scheduling of the ack
callback.
was:
The test has been observed as flaky in the recent runs:
[https://develocity.apache.org/scans/tests?search.names=CI%20workflow%2CGit%20repository&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=Asia%2FCalcutta&search.values=CI%2Chttps:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.clients.consumer.ShareConsumerTest&tests.sortField=FLAKY&tests.test=testShareGroupMaxSizeConfigExceeded()%5B1%5D]
After tracing the failures, the root cause was narrowed down to commit
[https://github.com/apache/kafka/commit/87657fdfc721055835f5b1f22151c461e85eab4a].
This change introduced a new try/catch around
{{handleCompletedAcknowledgements()}} in {{{}ShareConsumerImpl.java{}}}. The
side effect is that all exceptions coming from the commit callback — including
GroupMaxSizeReachedException — are now swallowed and only logged, preventing
the test from ever receiving the exception.I synced with Shiv and we think that
the resulting flakiness is timing dependent: * If the ack callback fires while
{{poll()}} is executing → exception is caught & swallowed → test times out
* If the callback fires outside that path → exception escapes → test passes
So the same test randomly passes/fails depending on scheduling of the ack
callback.
> Flaky test ShareConsumerTest. testShareGroupMaxSizeConfigExceeded
> -----------------------------------------------------------------
>
> Key: KAFKA-19840
> URL: https://issues.apache.org/jira/browse/KAFKA-19840
> Project: Kafka
> Issue Type: Sub-task
> Reporter: Sanskar Jhajharia
> Priority: Major
>
> The test has been observed as flaky in the recent runs:
> [https://develocity.apache.org/scans/tests?search.names=CI%20workflow%2CGit%20repository&search.relativeStartTime=P28D&search.rootProjectNames=kafka&search.tags=github%2Ctrunk&search.tasks=test&search.timeZoneId=Asia%2FCalcutta&search.values=CI%2Chttps:%2F%2Fgithub.com%2Fapache%2Fkafka&tests.container=org.apache.kafka.clients.consumer.ShareConsumerTest&tests.sortField=FLAKY&tests.test=testShareGroupMaxSizeConfigExceeded()%5B1%5D]
>
>
> After tracing the failures, the root cause was narrowed down to commit
> [https://github.com/apache/kafka/commit/87657fdfc721055835f5b1f22151c461e85eab4a].
> This change introduced a new try/catch around
> {{handleCompletedAcknowledgements()}} in {{{}ShareConsumerImpl.java{}}}. The
> side effect is that all exceptions coming from the commit callback —
> including GroupMaxSizeReachedException — are now swallowed and only logged,
> preventing the test from ever receiving the exception.
> - If the ack callback fires while {{poll()}} is executing → exception is
> caught & swallowed → test times out
> - If the callback fires outside that path → exception escapes → test passes
> So the same test randomly passes/fails depending on scheduling of the ack
> callback.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)