Hello, I am trying to set up a small Kafka cluster on VMs before promoting it to dedicated hardware in a cloud environment. I'm running into an issue when starting Connect in distributed mode. First, my understanding is that starting Connect in distributed mode will cause it to create a REST server which is then used to add connectors. Additionally it will also attempt to automatically create topics for storing offsets, configuration, and status information. The problem I am encountering is that shortly after starting and I've confirmed the REST server is responding an exception is thrown in the "herder work thread" stating that the position for partition connect-offsets-0 could not be determined. These are some of the relevant logs I can find. The series starting with "Updated cluster metadata version..." repeats quite a lot previous to this section.
Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: [KAFKA-CONNECT] [2018-10-07 16:41:55,647] DEBUG Updated cluster metadata version 584 to Cluster(id = 9xi4YtlwQnyeAGC15jx5UA, nodes = [ 192.168.56.102:9092 (id: 0 rack: null)], parti> Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: [KAFKA-CONNECT] [2018-10-07 16:41:55,647] DEBUG [Consumer clientId=consumer-1, groupId=connect-cluster] Sending FindCoordinator request to broker 192.168.56.102:9092 (id: 0 rack: n> Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer clientId=consumer-1, groupId=connect-cluster] Received FindCoordinator response ClientResponse(receivedTimeMs=153893051564> Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer clientId=consumer-1, groupId=connect-cluster] Group coordinator lookup failed: The coordinator is not available. (org.apac> Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer clientId=consumer-1, groupId=connect-cluster] Coordinator discovery failed, refreshing metadata (org.apache.kafka.clients.> Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: [KAFKA-CONNECT] [2018-10-07 16:41:55,728] DEBUG [Consumer clientId=consumer-1, groupId=connect-cluster] Sending FindCoordinator request to broker 192.168.56.102:9092 (id: 0 rack: n> Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: [KAFKA-CONNECT] [2018-10-07 16:41:55,729] ERROR Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder) Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired before the position for partition connect-offsets-0 could be determined I can confirm the Connect instance is creating the offsets topic automatically with the correct replication and partition options. It is not creating the config or status topics. I can also confirm that zookeeper and the kafka broker are talking to each other happily. I can create topics from the command line and produce and consume messages using the command line producers and consumers. I have not registered any connectors with Connect yet as the instance dies after the timeout occurs. It doesn't appear to me that there are any problems with the connections anywhere, unless the herder process connects to the cluster differently than the main process and the problem is there. I don't know how I would go about diagnosing that though. The test cluster I'm creating consists of three VMs, one is running Zookeeper, one is running Kafka and Kafka Connect, and the other is acting as a log sink for the first two. All programs are being run as systemd services. I'm using the 2.0 Kafka distribution. I'm out of ideas on where else to start looking for leads. I'd appreciate any insight on that matter. There's a lot of configuration options so if this is a configuration problem I'd appreciate any insight there as well. Otherwise any and all other help is appreciated. Daniel Wilson-Thomas Configuration and other output follows: Output of kafkacat: $ kafkacat -L -b 192.168.56.102:9092 Metadata for all topics (from broker 0: 192.168.56.102:9092/0): 1 brokers: broker 0 at 192.168.56.102:9092 2 topics: topic "a-new-topic" with 1 partitions: partition 0, leader 0, replicas: 0, isrs: 0 topic "connect-offsets" with 1 partitions: partition 0, leader 0, replicas: 0, isrs: 0 No special user config for Zookeeper. Kafka user configuration: Bumps up the zookeeper related timeouts (I can probably get rid of those now) and configures the listeners for the dynamically assigned VM ip address listeners=PLAINTEXT://${ip}:9092 zookeeper.connection.timeout.ms=600000 zookeeper.session.timeout.ms=12000 Kafka JVM Options "-server" "-Xmx512M" "-Xms128M" "-XX:+UseCompressedOops" "-XX:+UseParNewGC" "-XX:+UseConcMarkSweepGC" "-XX:+CMSClassUnloadingEnabled" "-XX:+CMSScavengeBeforeRemark" "-XX:+DisableExplicitGC" "-Djava.awt.headless=true" "-Djava.net.preferIPv4Stack=true" Connect user configuration: Set up bootstrap servers to point to the dynamically assigned kafka instance. Set converters. Set topic information. Only one broker so all replication and partitions values are 1. bootstrap.servers = ${bootstrapServers} key.converter = org.apache.kafka.connect.json.JsonConverter value.converter = org.apache.kafka.connect.json.JsonConverter group.id = connect-cluster rest.port = 8080 config.storage.topic = connect-config config.storage.replication.factor = 1 offset.storage.topic = connect-offsets offset.storage.replication.factor = 1 offset.storage.partitions=1 status.storage.topic = connect-status status.storage.replication.factor = 1 status.storage.partitions=1 Connect JVM options: "-Xms128M" "-Xmx512M" Full debug logs are available here https://gist.github.com/RocketPuppy/5e0ee2fc6379325458d84fc875a90938. If you are familiar with Nix and NixOps, you can create an environment to replicate by using the files here: https://gist.github.com/RocketPuppy/018fc5f60b05e557de0c37e0749232a5.