Asker created KAFKA-18007:
-----------------------------

             Summary: MirrorCheckpointConnector fails with “Timeout while 
loading consumer groups” after upgrading to Kafka 3.9.0
                 Key: KAFKA-18007
                 URL: https://issues.apache.org/jira/browse/KAFKA-18007
             Project: Kafka
          Issue Type: Bug
          Components: mirrormaker
    Affects Versions: 3.9.0
         Environment: - Kafka Version: Upgraded sequentially from 3.6.0 to 3.9.0
- Clusters: Three clusters named A, B, and C
- Clusters A and B mirror topics to cluster C using MirrorMaker 2
- Number of Consumer Groups: Approximately 200
- Number of Topics: Approximately 2000
- Operating System: Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-135-generic x86_64)
            Reporter: Asker


After upgrading our Kafka clusters from version 3.6.0 to 3.9.0, we started 
experiencing repeated errors with the MirrorCheckpointConnector in MirrorMaker 
2. The connector fails with a RetriableException stating “Timeout while loading 
consumer groups.” This issue persists despite several attempts to resolve it.
Error Message:
{code:bash}
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]: [2024-11-11 12:21:57,342] ERROR [Worker 
clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to 
reconfigure connector's tasks (MirrorCheckpointConnector), retrying after 
backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]: 
org.apache.kafka.connect.errors.RetriableException: Timeout while loading 
consumer groups.
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:398)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnector(DistributedHerder.java:2243)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnectorTasksWithExponentialBackoffRetries(DistributedHerder.java:2183)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$null$47(DistributedHerder.java:2199)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.runRequest(DistributedHerder.java:2402)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:498)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:383)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
connect-mirror-maker.sh[2526630]:         at 
java.base/java.lang.Thread.run(Thread.java:840){code}
Steps to Reproduce:
1. Upgrade Kafka clusters sequentially from 3.6.0 to 3.9.0.
2. Configure MirrorMaker 2 to mirror topics from clusters A and B to cluster C.
3. Start MirrorMaker 2.
4. Observe the logs for the MirrorCheckpointConnector.

What We Tried:
{*}Checked ACLs and Authentication{*}:
 - Ensured that the mirror_maker user has the necessary permissions and can 
authenticate successfully.
 - Verified that we could list consumer groups using kafka-consumer-groups.sh 
with the mirror_maker user.

{*}Increased Timeouts{*}:
 - Increased admin.timeout.ms to 300000 (5 minutes) and even higher values.
 - Adjusted admin.request.timeout.ms and admin.retry.backoff.ms accordingly.

{*}Enabled Detailed Logging{*}:
 - Set the logging level to DEBUG for org.apache.kafka.connect.mirror to gain 
more insights.
 - No additional information that could help resolve the issue was found.

{*}Temporary Workarounds{*}:
 - Disabled emit.checkpoints.enabled and sync.group.offsets.enabled to prevent 
the MirrorCheckpointConnector from running.
 - This is not a viable long-term solution as we need to synchronize consumer 
group offsets.

Resolution:
Rolled Back to Kafka 3.8.1:
 - As a test, we downgraded our Kafka clusters back to version 3.8.1.
 - After the downgrade, the error disappeared, and the 
MirrorCheckpointConnector functioned correctly.
 - This suggests that the issue was introduced in version 3.9.0.

Analysis:
Possible Relation to KAFKA-17232:
 - We found the JIRA issue KAFKA-17232 titled “MirrorCheckpointConnector does 
not generate task configs if initial consumer group load times out.”
 - It appears that changes introduced in Kafka 3.9.0 related to this issue may 
have inadvertently caused our problem.
 - However, our clusters are not particularly large, and the initial consumer 
group load should not exceed the timeouts.

Request:
{*}Assistance in Resolving the Issue{*}:
 - Is there a known workaround or configuration change that can prevent this 
error in Kafka 3.9.0?
 - Could the changes made in KAFKA-17232 have unintentionally caused this 
problem?
 - Are there plans to address this issue in an upcoming release?

*Guidance on Next Steps*:
 - Should we avoid upgrading to versions beyond 3.8.1 until this issue is 
resolved?
 - Is it advisable to apply any patches or pull requests manually?

Thank you for your attention to this matter. Please let me know if I can 
provide any additional information to help resolve this issue.

Best regards,
Asker Kakhramanov



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to