[ https://issues.apache.org/jira/browse/KAFKA-18007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946998#comment-17946998 ]
Mickael Maison commented on KAFKA-18007: ---------------------------------------- I had a deployment hitting this same error. It was running MirrorMaker in dedicated mode (started via connect-mirror-maker.sh), mirroring data from A to B. The B to A flow was disabled (B->A.enabled = false) but an instance of MirrorCheckpointConnector with client is B->A was still running and throwing this exception. The trick to shut it down was to explicitly also disable the heartbeat connector for B->A by setting B->A.emit.heartbeats.enabled=false in the configuration. The reason this happened is because by default emit.heartbeats.enabled is true and in a A->B flow the heartbeats connector actually produces data to A (it's inverted from the other MirrorMaker connectors). This caused MirrorCheckpointConnector to try running in the B->A flow and because that flow is disabled it failed and logged the exception. This behavior is described in [https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java#L126-L130|https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java#L126-L130] and you can see in this method that a flow is still created even if it's explicitly disabled if the heartbeats connector is not also explicitly disabled. > MirrorCheckpointConnector fails with “Timeout while loading consumer groups” > after upgrading to Kafka 3.9.0 > ----------------------------------------------------------------------------------------------------------- > > Key: KAFKA-18007 > URL: https://issues.apache.org/jira/browse/KAFKA-18007 > Project: Kafka > Issue Type: Bug > Components: mirrormaker > Affects Versions: 3.9.0 > Environment: - Kafka Version: Upgraded sequentially from 3.6.0 to > 3.9.0 > - Clusters: Three clusters named A, B, and C > - Clusters A and B mirror topics to cluster C using MirrorMaker 2 > - Number of Consumer Groups: Approximately 200 > - Number of Topics: Approximately 2000 > - Operating System: Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-135-generic x86_64) > Reporter: Asker > Priority: Major > > After upgrading our Kafka clusters from version 3.6.0 to 3.9.0, we started > experiencing repeated errors with the MirrorCheckpointConnector in > MirrorMaker 2. The connector fails with a RetriableException stating “Timeout > while loading consumer groups.” This issue persists despite several attempts > to resolve it. > Error Message: > {code:bash} > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: [2024-11-11 12:21:57,342] ERROR [Worker > clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to > reconfigure connector's tasks (MirrorCheckpointConnector), retrying after > backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: > org.apache.kafka.connect.errors.RetriableException: Timeout while loading > consumer groups. > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:398) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnector(DistributedHerder.java:2243) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnectorTasksWithExponentialBackoffRetries(DistributedHerder.java:2183) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$null$47(DistributedHerder.java:2199) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.runRequest(DistributedHerder.java:2402) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:498) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:383) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) > Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech > connect-mirror-maker.sh[2526630]: at > java.base/java.lang.Thread.run(Thread.java:840){code} > Steps to Reproduce: > 1. Upgrade Kafka clusters sequentially from 3.6.0 to 3.9.0. > 2. Configure MirrorMaker 2 to mirror topics from clusters A and B to cluster > C. > 3. Start MirrorMaker 2. > 4. Observe the logs for the MirrorCheckpointConnector. > What We Tried: > {*}Checked ACLs and Authentication{*}: > - Ensured that the mirror_maker user has the necessary permissions and can > authenticate successfully. > - Verified that we could list consumer groups using kafka-consumer-groups.sh > with the mirror_maker user. > {*}Increased Timeouts{*}: > - Increased admin.timeout.ms to 300000 (5 minutes) and even higher values. > - Adjusted admin.request.timeout.ms and admin.retry.backoff.ms accordingly. > {*}Enabled Detailed Logging{*}: > - Set the logging level to DEBUG for org.apache.kafka.connect.mirror to gain > more insights. > - No additional information that could help resolve the issue was found. > {*}Temporary Workarounds{*}: > - Disabled emit.checkpoints.enabled and sync.group.offsets.enabled to > prevent the MirrorCheckpointConnector from running. > - This is not a viable long-term solution as we need to synchronize consumer > group offsets. > Resolution: > Rolled Back to Kafka 3.8.1: > - As a test, we downgraded our Kafka clusters back to version 3.8.1. > - After the downgrade, the error disappeared, and the > MirrorCheckpointConnector functioned correctly. > - This suggests that the issue was introduced in version 3.9.0. > Analysis: > Possible Relation to KAFKA-17232: > - We found the JIRA issue KAFKA-17232 titled “MirrorCheckpointConnector does > not generate task configs if initial consumer group load times out.” > - It appears that changes introduced in Kafka 3.9.0 related to this issue > may have inadvertently caused our problem. > - However, our clusters are not particularly large, and the initial consumer > group load should not exceed the timeouts. > Request: > {*}Assistance in Resolving the Issue{*}: > - Is there a known workaround or configuration change that can prevent this > error in Kafka 3.9.0? > - Could the changes made in KAFKA-17232 have unintentionally caused this > problem? > - Are there plans to address this issue in an upcoming release? > *Guidance on Next Steps*: > - Should we avoid upgrading to versions beyond 3.8.1 until this issue is > resolved? > - Is it advisable to apply any patches or pull requests manually? > Thank you for your attention to this matter. Please let me know if I can > provide any additional information to help resolve this issue. > Best regards, > Asker Kakhramanov -- This message was sent by Atlassian Jira (v8.20.10#820010)