[ 
https://issues.apache.org/jira/browse/KAFKA-18007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17946998#comment-17946998
 ] 

Mickael Maison commented on KAFKA-18007:
----------------------------------------

I had a deployment hitting this same error. It was running MirrorMaker in 
dedicated mode (started via connect-mirror-maker.sh), mirroring data from A to 
B.
The B to A flow was disabled (B->A.enabled = false) but an instance of 
MirrorCheckpointConnector with client is B->A was still running and throwing 
this exception. The trick to shut it down was to explicitly also disable the 
heartbeat connector for B->A by setting B->A.emit.heartbeats.enabled=false in 
the configuration.

The reason this happened is because by default emit.heartbeats.enabled is true 
and in a A->B flow the heartbeats connector actually produces data to A (it's 
inverted from the other MirrorMaker connectors). This caused 
MirrorCheckpointConnector to try running in the B->A flow and because that flow 
is disabled it failed and logged the exception. 

This behavior is described in 
[https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java#L126-L130|https://github.com/apache/kafka/blob/trunk/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorMakerConfig.java#L126-L130]
 and you can see in this method that a flow is still created even if it's 
explicitly disabled if the heartbeats connector is not also explicitly disabled.

 

> MirrorCheckpointConnector fails with “Timeout while loading consumer groups” 
> after upgrading to Kafka 3.9.0
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-18007
>                 URL: https://issues.apache.org/jira/browse/KAFKA-18007
>             Project: Kafka
>          Issue Type: Bug
>          Components: mirrormaker
>    Affects Versions: 3.9.0
>         Environment: - Kafka Version: Upgraded sequentially from 3.6.0 to 
> 3.9.0
> - Clusters: Three clusters named A, B, and C
> - Clusters A and B mirror topics to cluster C using MirrorMaker 2
> - Number of Consumer Groups: Approximately 200
> - Number of Topics: Approximately 2000
> - Operating System: Ubuntu 20.04.5 LTS (GNU/Linux 5.4.0-135-generic x86_64)
>            Reporter: Asker
>            Priority: Major
>
> After upgrading our Kafka clusters from version 3.6.0 to 3.9.0, we started 
> experiencing repeated errors with the MirrorCheckpointConnector in 
> MirrorMaker 2. The connector fails with a RetriableException stating “Timeout 
> while loading consumer groups.” This issue persists despite several attempts 
> to resolve it.
> Error Message:
> {code:bash}
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]: [2024-11-11 12:21:57,342] ERROR [Worker 
> clientId=analytics-dev->app-dev, groupId=analytics-dev-mm2] Failed to 
> reconfigure connector's tasks (MirrorCheckpointConnector), retrying after 
> backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]: 
> org.apache.kafka.connect.errors.RetriableException: Timeout while loading 
> consumer groups.
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> org.apache.kafka.connect.runtime.Worker.connectorTaskConfigs(Worker.java:398)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnector(DistributedHerder.java:2243)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.reconfigureConnectorTasksWithExponentialBackoffRetries(DistributedHerder.java:2183)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.lambda$null$47(DistributedHerder.java:2199)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.runRequest(DistributedHerder.java:2402)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.tick(DistributedHerder.java:498)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:383)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
> Nov 11 12:21:57 kafka-analytics-2a.dev.vm.tech 
> connect-mirror-maker.sh[2526630]:         at 
> java.base/java.lang.Thread.run(Thread.java:840){code}
> Steps to Reproduce:
> 1. Upgrade Kafka clusters sequentially from 3.6.0 to 3.9.0.
> 2. Configure MirrorMaker 2 to mirror topics from clusters A and B to cluster 
> C.
> 3. Start MirrorMaker 2.
> 4. Observe the logs for the MirrorCheckpointConnector.
> What We Tried:
> {*}Checked ACLs and Authentication{*}:
>  - Ensured that the mirror_maker user has the necessary permissions and can 
> authenticate successfully.
>  - Verified that we could list consumer groups using kafka-consumer-groups.sh 
> with the mirror_maker user.
> {*}Increased Timeouts{*}:
>  - Increased admin.timeout.ms to 300000 (5 minutes) and even higher values.
>  - Adjusted admin.request.timeout.ms and admin.retry.backoff.ms accordingly.
> {*}Enabled Detailed Logging{*}:
>  - Set the logging level to DEBUG for org.apache.kafka.connect.mirror to gain 
> more insights.
>  - No additional information that could help resolve the issue was found.
> {*}Temporary Workarounds{*}:
>  - Disabled emit.checkpoints.enabled and sync.group.offsets.enabled to 
> prevent the MirrorCheckpointConnector from running.
>  - This is not a viable long-term solution as we need to synchronize consumer 
> group offsets.
> Resolution:
> Rolled Back to Kafka 3.8.1:
>  - As a test, we downgraded our Kafka clusters back to version 3.8.1.
>  - After the downgrade, the error disappeared, and the 
> MirrorCheckpointConnector functioned correctly.
>  - This suggests that the issue was introduced in version 3.9.0.
> Analysis:
> Possible Relation to KAFKA-17232:
>  - We found the JIRA issue KAFKA-17232 titled “MirrorCheckpointConnector does 
> not generate task configs if initial consumer group load times out.”
>  - It appears that changes introduced in Kafka 3.9.0 related to this issue 
> may have inadvertently caused our problem.
>  - However, our clusters are not particularly large, and the initial consumer 
> group load should not exceed the timeouts.
> Request:
> {*}Assistance in Resolving the Issue{*}:
>  - Is there a known workaround or configuration change that can prevent this 
> error in Kafka 3.9.0?
>  - Could the changes made in KAFKA-17232 have unintentionally caused this 
> problem?
>  - Are there plans to address this issue in an upcoming release?
> *Guidance on Next Steps*:
>  - Should we avoid upgrading to versions beyond 3.8.1 until this issue is 
> resolved?
>  - Is it advisable to apply any patches or pull requests manually?
> Thank you for your attention to this matter. Please let me know if I can 
> provide any additional information to help resolve this issue.
> Best regards,
> Asker Kakhramanov



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to