Kafka Connect herder timeout when getting position for newly created offsets partition.

Daniel Wilson Sun, 07 Oct 2018 11:26:11 -0700

Hello,

I am trying to set up a small Kafka cluster on VMs before promoting it to
dedicated hardware in a cloud environment. I'm running into an issue when
starting Connect in distributed mode. First, my understanding is that
starting Connect in distributed mode will cause it to create a REST server
which is then used to add connectors. Additionally it will also attempt to
automatically create topics for storing offsets, configuration, and status
information. The problem I am encountering is that shortly after starting
and I've confirmed the REST server is responding an exception is thrown in
the "herder work thread" stating that the position for partition
connect-offsets-0 could not be determined. These are some of the relevant
logs I can find. The series starting with "Updated cluster metadata
version..." repeats quite a lot previous to this section.


    Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
[KAFKA-CONNECT] [2018-10-07 16:41:55,647] DEBUG Updated cluster metadata
version 584 to Cluster(id = 9xi4YtlwQnyeAGC15jx5UA, nodes = [
192.168.56.102:9092 (id: 0 rack: null)], parti>
    Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
[KAFKA-CONNECT] [2018-10-07 16:41:55,647] DEBUG [Consumer
clientId=consumer-1, groupId=connect-cluster] Sending FindCoordinator
request to broker 192.168.56.102:9092 (id: 0 rack: n>
    Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
[KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer
clientId=consumer-1, groupId=connect-cluster] Received FindCoordinator
response ClientResponse(receivedTimeMs=153893051564>
    Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
[KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer
clientId=consumer-1, groupId=connect-cluster] Group coordinator lookup
failed: The coordinator is not available. (org.apac>
    Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
[KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer
clientId=consumer-1, groupId=connect-cluster] Coordinator discovery failed,
refreshing metadata (org.apache.kafka.clients.>
    Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
[KAFKA-CONNECT] [2018-10-07 16:41:55,728] DEBUG [Consumer
clientId=consumer-1, groupId=connect-cluster] Sending FindCoordinator
request to broker 192.168.56.102:9092 (id: 0 rack: n>
    Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
[KAFKA-CONNECT] [2018-10-07 16:41:55,729] ERROR Uncaught exception in
herder work thread, exiting:
(org.apache.kafka.connect.runtime.distributed.DistributedHerder)
    Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired
before the position for partition connect-offsets-0 could be determined

I can confirm the Connect instance is creating the offsets topic
automatically with the correct replication and partition options. It is not
creating the config or status topics. I can also confirm that zookeeper and
the kafka broker are talking to each other happily. I can create topics
from the command line and produce and consume messages using the command
line producers and consumers. I have not registered any connectors with
Connect yet as the instance dies after the timeout occurs. It doesn't
appear to me that there are any problems with the connections anywhere,
unless the herder process connects to the cluster differently than the main
process and the problem is there. I don't know how I would go about
diagnosing that though.

The test cluster I'm creating consists of three VMs, one is running
Zookeeper, one is running Kafka and Kafka Connect, and the other is acting
as a log sink for the first two. All programs are being run as systemd
services. I'm using the 2.0 Kafka distribution.

I'm out of ideas on where else to start looking for leads. I'd appreciate
any insight on that matter. There's a lot of configuration options so if
this is a configuration problem I'd appreciate any insight there as well.
Otherwise any and all other help is appreciated.

Daniel Wilson-Thomas

Configuration and other output follows:

Output of kafkacat:

    $ kafkacat -L -b 192.168.56.102:9092
    Metadata for all topics (from broker 0: 192.168.56.102:9092/0):
     1 brokers:
      broker 0 at 192.168.56.102:9092
     2 topics:
      topic "a-new-topic" with 1 partitions:
        partition 0, leader 0, replicas: 0, isrs: 0
      topic "connect-offsets" with 1 partitions:
        partition 0, leader 0, replicas: 0, isrs: 0

No special user config for Zookeeper.

Kafka user configuration:
Bumps up the zookeeper related timeouts (I can probably get rid of those
now) and configures the listeners for the dynamically assigned VM ip address

    listeners=PLAINTEXT://${ip}:9092
    zookeeper.connection.timeout.ms=600000
    zookeeper.session.timeout.ms=12000

Kafka JVM Options
    "-server"
    "-Xmx512M"
    "-Xms128M"
    "-XX:+UseCompressedOops"
    "-XX:+UseParNewGC"
    "-XX:+UseConcMarkSweepGC"
    "-XX:+CMSClassUnloadingEnabled"
    "-XX:+CMSScavengeBeforeRemark"
    "-XX:+DisableExplicitGC"
    "-Djava.awt.headless=true"
    "-Djava.net.preferIPv4Stack=true"

Connect user configuration:
Set up bootstrap servers to point to the dynamically assigned kafka
instance. Set converters. Set topic information. Only one broker so all
replication and partitions values are 1.

    bootstrap.servers = ${bootstrapServers}
    key.converter = org.apache.kafka.connect.json.JsonConverter
    value.converter = org.apache.kafka.connect.json.JsonConverter
    group.id = connect-cluster
    rest.port = 8080
    config.storage.topic = connect-config
    config.storage.replication.factor = 1
    offset.storage.topic = connect-offsets
    offset.storage.replication.factor = 1
    offset.storage.partitions=1
    status.storage.topic = connect-status
    status.storage.replication.factor = 1
    status.storage.partitions=1

Connect JVM options:
    "-Xms128M"
    "-Xmx512M"


Full debug logs are available here
https://gist.github.com/RocketPuppy/5e0ee2fc6379325458d84fc875a90938.

If you are familiar with Nix and NixOps, you can create an environment to
replicate by using the files here:
https://gist.github.com/RocketPuppy/018fc5f60b05e557de0c37e0749232a5.

Kafka Connect herder timeout when getting position for newly created offsets partition.

Reply via email to