George Yang created KAFKA-18386: ----------------------------------- Summary: Mirror Maker2 Pod CrashLoopBackoff When one DC is powered off Key: KAFKA-18386 URL: https://issues.apache.org/jira/browse/KAFKA-18386 Project: Kafka Issue Type: Bug Components: mirrormaker Affects Versions: 3.7.1 Reporter: George Yang
When using Kubernetes deployment with MirrorMaker v3.7.1 and deploying one Kafka node in each data center (DC1 and DC2), if DC1 is powered off, DC2 will encounter a CrashLoopBackOff error. This issue is different from the one described in KAFKA-17784. Please find the report log below: ```log [2025-01-01 08:05:53,432] WARN [AdminClient clientId=dc64->dc88] Connection to node -1 (/192.168.2.88:13399) could not be established. Node may not be available. (org.apache.kafka.clients.NetworkClient:830)[kafka-admin-client-thread | dc64->dc88] [2025-01-01 08:05:55,652] INFO [AdminClient clientId=dc64->dc88] Metadata update failed (org.apache.kafka.clients.admin.internals.AdminMetadataManager:267)[kafka-admin-client-thread | dc64->dc88] org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata [2025-01-01 08:05:55,653] INFO App info kafka.admin.client for dc64->dc88 unregistered (org.apache.kafka.common.utils.AppInfoParser:88)[kafka-admin-client-thread | dc64->dc88] [2025-01-01 08:05:55,653] INFO [AdminClient clientId=dc64->dc88] Metadata update failed (org.apache.kafka.clients.admin.internals.AdminMetadataManager:267)[kafka-admin-client-thread | dc64->dc88] org.apache.kafka.common.errors.TimeoutException: Timed out waiting to send the call. Call: fetchMetadata [2025-01-01 08:05:55,653] INFO [AdminClient clientId=dc64->dc88] Timed out 1 remaining operation(s) during close. (org.apache.kafka.clients.admin.KafkaAdminClient:1450)[kafka-admin-client-thread | dc64->dc88] [2025-01-01 08:05:55,657] INFO Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics:684)[kafka-admin-client-thread | dc64->dc88] [2025-01-01 08:05:55,658] INFO Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics:688)[kafka-admin-client-thread | dc64->dc88] [2025-01-01 08:05:55,658] INFO Metrics reporters closed (org.apache.kafka.common.metrics.Metrics:694)[kafka-admin-client-thread | dc64->dc88] [2025-01-01 08:05:55,658] ERROR Stopping due to error (org.apache.kafka.connect.mirror.MirrorMaker:360)[main] org.apache.kafka.connect.errors.ConnectException: Failed to connect to and describe Kafka cluster. Check worker's broker connection and security properties. at org.apache.kafka.connect.runtime.WorkerConfig.lookupKafkaClusterId(WorkerConfig.java:305) at org.apache.kafka.connect.runtime.WorkerConfig.lookupKafkaClusterId(WorkerConfig.java:285) at org.apache.kafka.connect.runtime.WorkerConfig.kafkaClusterId(WorkerConfig.java:415) at org.apache.kafka.connect.mirror.MirrorMaker.addHerder(MirrorMaker.java:252) at java.base/java.lang.Iterable.forEach(Unknown Source) at org.apache.kafka.connect.mirror.MirrorMaker.<init>(MirrorMaker.java:158) at org.apache.kafka.connect.mirror.MirrorMaker.<init>(MirrorMaker.java:170) at org.apache.kafka.connect.mirror.MirrorMaker.<init>(MirrorMaker.java:174) at org.apache.kafka.connect.mirror.MirrorMaker.main(MirrorMaker.java:347) Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: listNodes at java.base/java.util.concurrent.CompletableFuture.reportGet(Unknown Source) at java.base/java.util.concurrent.CompletableFuture.get(Unknown Source) at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:165) at org.apache.kafka.connect.runtime.WorkerConfig.lookupKafkaClusterId(WorkerConfig.java:299) ... 8 more Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node assignment. Call: listNodes [2025-01-01 08:05:55,687] INFO Stopped http_8083@6705fb02\{HTTP/1.1, (http/1.1)}{0.0.0.0:8083} (org.eclipse.jetty.server.AbstractConnector:383)[JettyShutdownThread] ``` The configuration of mirrormaker is: ``` clusters = dc64, dc88 dc64.bootstrap.servers = 192.168.2.64:13399 dc88.bootstrap.servers = 192.168.2.88:13399 dc64->dc88.enabled = true dc64->dc88.topics = .* dc88->dc64.enabled = true dc88->dc64.topics = .* replication.factor=1 tasks.max=6 emit.checkpoints.interval.seconds=5 dc64.producer.acks=all dc64.producer.batch.size=50000 dc64.consumer.auto.offset.reset=latest dc88.consumer.auto.offset.reset=latest dc64.consumer.max.poll.interval.ms=20000 dc88.consumer.max.poll.interval.ms=20000 refresh.topics.enabled=true refresh.topics.interval.seconds=5 refresh.groups.enabled=true refresh.groups.interval.seconds=5 dedicated.mode.enable.internal.rest = true dc64.scheduled.rebalance.max.delay.ms=20000 dc88.scheduled.rebalance.max.delay.ms=20000 checkpoints.topic.replication.factor=1 heartbeats.topic.replication.factor=1 offset-syncs.topic.replication.factor=1 offset.storage.replication.factor=1 status.storage.replication.factor=1 config.storage.replication.factor=1 ``` -- This message was sent by Atlassian Jira (v8.20.10#820010)