[ 
https://issues.apache.org/jira/browse/KAFKA-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17897590#comment-17897590
 ] 

Asker commented on KAFKA-17232:
-------------------------------

Hello,
I believe I'm experiencing the issue described in KAFKA-17232 with the 
MirrorCheckpointConnector.

*Environment*:
 - *{*}Kafka version:{*}* Upgraded sequentially from 3.6.0 to 3.9.0
 - *{*}Clusters:{*}* We have three clusters (A, B, C). Clusters A and B mirror 
topics to cluster C using MirrorMaker 2.

*Problem*:
{code:bash}
After upgrading to Kafka 3.9.0, we're seeing the following error repeatedly in 
our logs:
2024-11-12 16:22:21,230] ERROR [Worker clientId=A->C, groupId=A-mm2] Failed to 
reconfigure connector’s tasks (MirrorCheckpointConnector), retrying after 
backoff. (org.apache.kafka.connect.runtime.distributed.DistributedHerder:2195)
org.apache.kafka.connect.errors.RetriableException: Timeout while loading 
consumer groups.
at 
org.apache.kafka.connect.mirror.MirrorCheckpointConnector.taskConfigs(MirrorCheckpointConnector.java:138)
…{code}
This error seems to align with the issue described in KAFKA-17232, where 
`knownConsumerGroups` remains `null` due to a timeout in 
`loadInitialConsumerGroups()`, causing the connector to fail to generate task 
configurations.

*What we've tried*:
- Increased admin.timeout.ms: Set it to a higher value (e.g., 300000), but the 
error persists.
- Adjusted admin.request.timeout.ms and admin.retry.backoff.ms: Also increased 
these values without success.
- Limited consumer groups: Since we have only about 20 consumer groups, we 
believe the initial load should not take too long.
- Checked ACLs and authentication: Ensured that the `mirror_maker` user has the 
necessary permissions and can authenticate successfully.
- Enabled detailed logging: No additional insights were gained.

*Questions*:
- Is there a known workaround for this issue until a fix is released?

Any guidance or assistance would be greatly appreciated, as this issue is 
impacting our ability to synchronize consumer group offsets using MirrorMaker 2.

Thank you!

Best regards,
Asker Kakhramanov

> MirrorCheckpointConnector does not generate task configs if initial consumer 
> group load times out
> -------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-17232
>                 URL: https://issues.apache.org/jira/browse/KAFKA-17232
>             Project: Kafka
>          Issue Type: Bug
>          Components: mirrormaker
>    Affects Versions: 3.9.0
>            Reporter: Greg Harris
>            Assignee: TengYao Chi
>            Priority: Major
>             Fix For: 3.9.0
>
>
> The MirrorCheckpointConnector has two operations that read the source 
> consumer groups:
>  * loadInitialConsumerGroups
>  * refreshConsumerGroups
> loadInitialConsumerGroups blocks the start() method of the connector, while 
> refreshConsumerGroups is asynchronous and runs periodically while the 
> connector is running.
> loadInitialConsumerGroups may take a long time to execute, and may exceed the 
> configured "admin.timeout.ms" used by the Scheduler. This timeout is logged 
> and the start() method returns normally. If this happens, the framework will 
> generate task configs immediately after start(), before 
> loadInitialConsumerGroups can finish, and will generate an empty set of task 
> configs: 
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L118-L121].
> Later, when loadInitialConsumerGroups completes, it will not request task 
> reconfiguration, believing it is the initial load operation.
> Later still, when refreshConsumerGroups completes, it will not request task 
> reconfiguration, as the set of consumer groups has not changed since the 
> initial load: 
> [https://github.com/apache/kafka/blob/e2494e6ffb89f8288ed2aeb9b5596c755210bffd/connect/mirror/src/main/java/org/apache/kafka/connect/mirror/MirrorCheckpointConnector.java#L173-L180]
>  
> This leads to a situation where the MirrorCheckpointConnector believes it has 
> converged with nothing to update, but actually has consumer groups that are 
> not allocated to tasks.
> This happens particularly for large, stable Kafka clusters with many consumer 
> groups that are not being actively created or deleted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to