Asker created KAFKA-18007: ----------------------------- Summary: MirrorCheckpointConnector fails with “Timeout while loading consumer groups” after upgrading to Kafka 3.9.0 Key: KAFKA-18007 URL: https://issues.apache.org/jira/browse/KAFKA-18007 Project: Kafka Issue Type: Bug Components: mirrormaker Affects Versions: 3.9.0 Environment: - Kafka Version: Upgraded sequentially from 3.6.0 to 3.9.0 - Clusters: Three clusters named A, B, and C - Clusters A and B mirror topics to cluster C using MirrorMaker 2 - Number of Consumer Groups: Approximately 200 - Number of Topics: Approximately 2000 - Operating System: Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-135-generic x86_64) Reporter: Asker
After upgrading our Kafka clusters from version 3.6.0 to 3.9.0, we started experiencing repeated errors with the MirrorCheckpointConnector in MirrorMaker 2. The connector fails with a RetriableException stating “Timeout while loading consumer groups.” This issue persists despite several attempts to resolve it. Error Message: {code:bash} Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: [2024-11-11 12:21:57,342] ERROR [Worker clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to reconfigure connector's tasks (MirrorCheckpointConnector), retrying after backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: org.apache.kafka.connect.errors.RetriableException: Timeout while loading consumer groups. Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:398) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnector(DistributedHerder.java:2243) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnectorTasksWithExponentialBackoffRetries(DistributedHerder.java:2183) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$null$47(DistributedHerder.java:2199) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.runRequest(DistributedHerder.java:2402) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:498) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:383) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech connect-mirror-maker.sh[2526630]: at java.base/java.lang.Thread.run(Thread.java:840){code} Steps to Reproduce: 1. Upgrade Kafka clusters sequentially from 3.6.0 to 3.9.0. 2. Configure MirrorMaker 2 to mirror topics from clusters A and B to cluster C. 3. Start MirrorMaker 2. 4. Observe the logs for the MirrorCheckpointConnector. What We Tried: {*}Checked ACLs and Authentication{*}: - Ensured that the mirror_maker user has the necessary permissions and can authenticate successfully. - Verified that we could list consumer groups using kafka-consumer-groups.sh with the mirror_maker user. {*}Increased Timeouts{*}: - Increased admin.timeout.ms to 300000 (5 minutes) and even higher values. - Adjusted admin.request.timeout.ms and admin.retry.backoff.ms accordingly. {*}Enabled Detailed Logging{*}: - Set the logging level to DEBUG for org.apache.kafka.connect.mirror to gain more insights. - No additional information that could help resolve the issue was found. {*}Temporary Workarounds{*}: - Disabled emit.checkpoints.enabled and sync.group.offsets.enabled to prevent the MirrorCheckpointConnector from running. - This is not a viable long-term solution as we need to synchronize consumer group offsets. Resolution: Rolled Back to Kafka 3.8.1: - As a test, we downgraded our Kafka clusters back to version 3.8.1. - After the downgrade, the error disappeared, and the MirrorCheckpointConnector functioned correctly. - This suggests that the issue was introduced in version 3.9.0. Analysis: Possible Relation to KAFKA-17232: - We found the JIRA issue KAFKA-17232 titled “MirrorCheckpointConnector does not generate task configs if initial consumer group load times out.” - It appears that changes introduced in Kafka 3.9.0 related to this issue may have inadvertently caused our problem. - However, our clusters are not particularly large, and the initial consumer group load should not exceed the timeouts. Request: {*}Assistance in Resolving the Issue{*}: - Is there a known workaround or configuration change that can prevent this error in Kafka 3.9.0? - Could the changes made in KAFKA-17232 have unintentionally caused this problem? - Are there plans to address this issue in an upcoming release? *Guidance on Next Steps*: - Should we avoid upgrading to versions beyond 3.8.1 until this issue is resolved? - Is it advisable to apply any patches or pull requests manually? Thank you for your attention to this matter. Please let me know if I can provide any additional information to help resolve this issue. Best regards, Asker Kakhramanov -- This message was sent by Atlassian Jira (v8.20.10#820010)