To anyone else struggling with this. It looks to have been caused by some default values for the offsets.topic.replication.factor and transaction.state.log.replication.factor properties. They default to 3 it seems. Since a replication factor higher than the number of nodes in your cluster doesn't work, and I have a single node cluster, I set these properties to 1. The connect workers are staying up now.
On Sun, Oct 7, 2018 at 2:25 PM Daniel Wilson <dan...@mojotech.com> wrote: > Hello, > > I am trying to set up a small Kafka cluster on VMs before promoting it to > dedicated hardware in a cloud environment. I'm running into an issue when > starting Connect in distributed mode. First, my understanding is that > starting Connect in distributed mode will cause it to create a REST server > which is then used to add connectors. Additionally it will also attempt to > automatically create topics for storing offsets, configuration, and status > information. The problem I am encountering is that shortly after starting > and I've confirmed the REST server is responding an exception is thrown in > the "herder work thread" stating that the position for partition > connect-offsets-0 could not be determined. These are some of the relevant > logs I can find. The series starting with "Updated cluster metadata > version..." repeats quite a lot previous to this section. > > Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: > [KAFKA-CONNECT] [2018-10-07 16:41:55,647] DEBUG Updated cluster metadata > version 584 to Cluster(id = 9xi4YtlwQnyeAGC15jx5UA, nodes = [ > 192.168.56.102:9092 (id: 0 rack: null)], parti> > Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: > [KAFKA-CONNECT] [2018-10-07 16:41:55,647] DEBUG [Consumer > clientId=consumer-1, groupId=connect-cluster] Sending FindCoordinator > request to broker 192.168.56.102:9092 (id: 0 rack: n> > Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: > [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer > clientId=consumer-1, groupId=connect-cluster] Received FindCoordinator > response ClientResponse(receivedTimeMs=153893051564> > Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: > [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer > clientId=consumer-1, groupId=connect-cluster] Group coordinator lookup > failed: The coordinator is not available. (org.apac> > Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: > [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer > clientId=consumer-1, groupId=connect-cluster] Coordinator discovery failed, > refreshing metadata (org.apache.kafka.clients.> > Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: > [KAFKA-CONNECT] [2018-10-07 16:41:55,728] DEBUG [Consumer > clientId=consumer-1, groupId=connect-cluster] Sending FindCoordinator > request to broker 192.168.56.102:9092 (id: 0 rack: n> > Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: > [KAFKA-CONNECT] [2018-10-07 16:41:55,729] ERROR Uncaught exception in > herder work thread, exiting: > (org.apache.kafka.connect.runtime.distributed.DistributedHerder) > Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]: > org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired > before the position for partition connect-offsets-0 could be determined > > I can confirm the Connect instance is creating the offsets topic > automatically with the correct replication and partition options. It is not > creating the config or status topics. I can also confirm that zookeeper and > the kafka broker are talking to each other happily. I can create topics > from the command line and produce and consume messages using the command > line producers and consumers. I have not registered any connectors with > Connect yet as the instance dies after the timeout occurs. It doesn't > appear to me that there are any problems with the connections anywhere, > unless the herder process connects to the cluster differently than the main > process and the problem is there. I don't know how I would go about > diagnosing that though. > > The test cluster I'm creating consists of three VMs, one is running > Zookeeper, one is running Kafka and Kafka Connect, and the other is acting > as a log sink for the first two. All programs are being run as systemd > services. I'm using the 2.0 Kafka distribution. > > I'm out of ideas on where else to start looking for leads. I'd appreciate > any insight on that matter. There's a lot of configuration options so if > this is a configuration problem I'd appreciate any insight there as well. > Otherwise any and all other help is appreciated. > > Daniel Wilson-Thomas > > Configuration and other output follows: > > Output of kafkacat: > > $ kafkacat -L -b 192.168.56.102:9092 > Metadata for all topics (from broker 0: 192.168.56.102:9092/0): > 1 brokers: > broker 0 at 192.168.56.102:9092 > 2 topics: > topic "a-new-topic" with 1 partitions: > partition 0, leader 0, replicas: 0, isrs: 0 > topic "connect-offsets" with 1 partitions: > partition 0, leader 0, replicas: 0, isrs: 0 > > No special user config for Zookeeper. > > Kafka user configuration: > Bumps up the zookeeper related timeouts (I can probably get rid of those > now) and configures the listeners for the dynamically assigned VM ip address > > listeners=PLAINTEXT://${ip}:9092 > zookeeper.connection.timeout.ms=600000 > zookeeper.session.timeout.ms=12000 > > Kafka JVM Options > "-server" > "-Xmx512M" > "-Xms128M" > "-XX:+UseCompressedOops" > "-XX:+UseParNewGC" > "-XX:+UseConcMarkSweepGC" > "-XX:+CMSClassUnloadingEnabled" > "-XX:+CMSScavengeBeforeRemark" > "-XX:+DisableExplicitGC" > "-Djava.awt.headless=true" > "-Djava.net.preferIPv4Stack=true" > > Connect user configuration: > Set up bootstrap servers to point to the dynamically assigned kafka > instance. Set converters. Set topic information. Only one broker so all > replication and partitions values are 1. > > bootstrap.servers = ${bootstrapServers} > key.converter = org.apache.kafka.connect.json.JsonConverter > value.converter = org.apache.kafka.connect.json.JsonConverter > group.id = connect-cluster > rest.port = 8080 > config.storage.topic = connect-config > config.storage.replication.factor = 1 > offset.storage.topic = connect-offsets > offset.storage.replication.factor = 1 > offset.storage.partitions=1 > status.storage.topic = connect-status > status.storage.replication.factor = 1 > status.storage.partitions=1 > > Connect JVM options: > "-Xms128M" > "-Xmx512M" > > > Full debug logs are available here > https://gist.github.com/RocketPuppy/5e0ee2fc6379325458d84fc875a90938. > > If you are familiar with Nix and NixOps, you can create an environment to > replicate by using the files here: > https://gist.github.com/RocketPuppy/018fc5f60b05e557de0c37e0749232a5. > >