Viktor Somogyi-Vass created KAFKA-15161: -------------------------------------------
Summary: InvalidReplicationFactorException at connect startup Key: KAFKA-15161 URL: https://issues.apache.org/jira/browse/KAFKA-15161 Project: Kafka Issue Type: Improvement Components: clients, KafkaConnect Affects Versions: 3.6.0 Reporter: Viktor Somogyi-Vass .h2 Problem description In our system test environment in certain cases due to a very specific timing issue Connect may fail to start up. the problem lies in the very specific timing of a Kafka cluster and connect start/restart. In these cases while the broker doesn't have metadata and a consumer in connect starts and asks for topic metadata, it returns the following exception and fails: {noformat} [2023-07-07 13:56:47,994] ERROR [Worker clientId=connect-1, groupId=connect-cluster] Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder) org.apache.kafka.common.KafkaException: Unexpected error fetching metadata for topic connect-offsets at org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:130) at org.apache.kafka.clients.consumer.internals.TopicMetadataFetcher.getTopicMetadata(TopicMetadataFetcher.java:66) at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:2001) at org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1969) at org.apache.kafka.connect.util.KafkaBasedLog.start(KafkaBasedLog.java:251) at org.apache.kafka.connect.storage.KafkaOffsetBackingStore.start(KafkaOffsetBackingStore.java:242) at org.apache.kafka.connect.runtime.Worker.start(Worker.java:230) at org.apache.kafka.connect.runtime.AbstractHerder.startServices(AbstractHerder.java:151) at org.apache.kafka.connect.runtime.distributed.DistributedHerder.run(DistributedHerder.java:363) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor is below 1 or larger than the number of available brokers. {noformat} Due to this error the connect node stops and it has to be manually restarted (and ofc it fails the test scenarios as well). .h2 Reproduction In my test scenario I had: - 1 broker - 1 connect distributed node - I also had a patch that I applied on the broker to make sure we don't have metadata Steps to repro: # start up a zookeeper based broker without the patch # put a breakpoint here: https://github.com/apache/kafka/blob/1d8b07ed6435568d3daf514c2d902107436d2ac8/clients/src/main/java/org/apache/kafka/clients/consumer/internals/TopicMetadataFetcher.java#L94 # start up a distributed connect node # restart the kafka broker with the patch to make sure there is no metadata # once the broker is started, release the debugger in connect It should run into the error cited above and shut down. This is not desirable, the connect cluster should retry to ensure its continuous operation or the broker should handle this case somehow differently, for instance by returning a RetriableException. The earliest I've tried this is 2.8 but I think this affects versions before that as well (and after). -- This message was sent by Atlassian Jira (v8.20.10#820010)