Re: Kafka Connect herder timeout when getting position for newly created offsets partition.

Daniel Wilson Fri, 12 Oct 2018 11:58:36 -0700

To anyone else struggling with this. It looks to have been caused by some
default values for the offsets.topic.replication.factor and
transaction.state.log.replication.factor properties. They default to 3 it
seems. Since a replication factor higher than the number of nodes in your
cluster doesn't work, and I have a single node cluster, I set these
properties to 1. The connect workers are staying up now.


On Sun, Oct 7, 2018 at 2:25 PM Daniel Wilson <dan...@mojotech.com> wrote:

> Hello,
>
> I am trying to set up a small Kafka cluster on VMs before promoting it to
> dedicated hardware in a cloud environment. I'm running into an issue when
> starting Connect in distributed mode. First, my understanding is that
> starting Connect in distributed mode will cause it to create a REST server
> which is then used to add connectors. Additionally it will also attempt to
> automatically create topics for storing offsets, configuration, and status
> information. The problem I am encountering is that shortly after starting
> and I've confirmed the REST server is responding an exception is thrown in
> the "herder work thread" stating that the position for partition
> connect-offsets-0 could not be determined. These are some of the relevant
> logs I can find. The series starting with "Updated cluster metadata
> version..." repeats quite a lot previous to this section.
>
>     Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
> [KAFKA-CONNECT] [2018-10-07 16:41:55,647] DEBUG Updated cluster metadata
> version 584 to Cluster(id = 9xi4YtlwQnyeAGC15jx5UA, nodes = [
> 192.168.56.102:9092 (id: 0 rack: null)], parti>
>     Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
> [KAFKA-CONNECT] [2018-10-07 16:41:55,647] DEBUG [Consumer
> clientId=consumer-1, groupId=connect-cluster] Sending FindCoordinator
> request to broker 192.168.56.102:9092 (id: 0 rack: n>
>     Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
> [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer
> clientId=consumer-1, groupId=connect-cluster] Received FindCoordinator
> response ClientResponse(receivedTimeMs=153893051564>
>     Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
> [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer
> clientId=consumer-1, groupId=connect-cluster] Group coordinator lookup
> failed: The coordinator is not available. (org.apac>
>     Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
> [KAFKA-CONNECT] [2018-10-07 16:41:55,648] DEBUG [Consumer
> clientId=consumer-1, groupId=connect-cluster] Coordinator discovery failed,
> refreshing metadata (org.apache.kafka.clients.>
>     Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
> [KAFKA-CONNECT] [2018-10-07 16:41:55,728] DEBUG [Consumer
> clientId=consumer-1, groupId=connect-cluster] Sending FindCoordinator
> request to broker 192.168.56.102:9092 (id: 0 rack: n>
>     Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
> [KAFKA-CONNECT] [2018-10-07 16:41:55,729] ERROR Uncaught exception in
> herder work thread, exiting:
> (org.apache.kafka.connect.runtime.distributed.DistributedHerder)
>     Oct 07 16:41:55 kafka-0 apache-kafka-connect-start[5184]:
> org.apache.kafka.common.errors.TimeoutException: Timeout of 60000ms expired
> before the position for partition connect-offsets-0 could be determined
>
> I can confirm the Connect instance is creating the offsets topic
> automatically with the correct replication and partition options. It is not
> creating the config or status topics. I can also confirm that zookeeper and
> the kafka broker are talking to each other happily. I can create topics
> from the command line and produce and consume messages using the command
> line producers and consumers. I have not registered any connectors with
> Connect yet as the instance dies after the timeout occurs. It doesn't
> appear to me that there are any problems with the connections anywhere,
> unless the herder process connects to the cluster differently than the main
> process and the problem is there. I don't know how I would go about
> diagnosing that though.
>
> The test cluster I'm creating consists of three VMs, one is running
> Zookeeper, one is running Kafka and Kafka Connect, and the other is acting
> as a log sink for the first two. All programs are being run as systemd
> services. I'm using the 2.0 Kafka distribution.
>
> I'm out of ideas on where else to start looking for leads. I'd appreciate
> any insight on that matter. There's a lot of configuration options so if
> this is a configuration problem I'd appreciate any insight there as well.
> Otherwise any and all other help is appreciated.
>
> Daniel Wilson-Thomas
>
> Configuration and other output follows:
>
> Output of kafkacat:
>
>     $ kafkacat -L -b 192.168.56.102:9092
>     Metadata for all topics (from broker 0: 192.168.56.102:9092/0):
>      1 brokers:
>       broker 0 at 192.168.56.102:9092
>      2 topics:
>       topic "a-new-topic" with 1 partitions:
>         partition 0, leader 0, replicas: 0, isrs: 0
>       topic "connect-offsets" with 1 partitions:
>         partition 0, leader 0, replicas: 0, isrs: 0
>
> No special user config for Zookeeper.
>
> Kafka user configuration:
> Bumps up the zookeeper related timeouts (I can probably get rid of those
> now) and configures the listeners for the dynamically assigned VM ip address
>
>     listeners=PLAINTEXT://${ip}:9092
>     zookeeper.connection.timeout.ms=600000
>     zookeeper.session.timeout.ms=12000
>
> Kafka JVM Options
>     "-server"
>     "-Xmx512M"
>     "-Xms128M"
>     "-XX:+UseCompressedOops"
>     "-XX:+UseParNewGC"
>     "-XX:+UseConcMarkSweepGC"
>     "-XX:+CMSClassUnloadingEnabled"
>     "-XX:+CMSScavengeBeforeRemark"
>     "-XX:+DisableExplicitGC"
>     "-Djava.awt.headless=true"
>     "-Djava.net.preferIPv4Stack=true"
>
> Connect user configuration:
> Set up bootstrap servers to point to the dynamically assigned kafka
> instance. Set converters. Set topic information. Only one broker so all
> replication and partitions values are 1.
>
>     bootstrap.servers = ${bootstrapServers}
>     key.converter = org.apache.kafka.connect.json.JsonConverter
>     value.converter = org.apache.kafka.connect.json.JsonConverter
>     group.id = connect-cluster
>     rest.port = 8080
>     config.storage.topic = connect-config
>     config.storage.replication.factor = 1
>     offset.storage.topic = connect-offsets
>     offset.storage.replication.factor = 1
>     offset.storage.partitions=1
>     status.storage.topic = connect-status
>     status.storage.replication.factor = 1
>     status.storage.partitions=1
>
> Connect JVM options:
>     "-Xms128M"
>     "-Xmx512M"
>
>
> Full debug logs are available here
> https://gist.github.com/RocketPuppy/5e0ee2fc6379325458d84fc875a90938.
>
> If you are familiar with Nix and NixOps, you can create an environment to
> replicate by using the files here:
> https://gist.github.com/RocketPuppy/018fc5f60b05e557de0c37e0749232a5.
>
>

Re: Kafka Connect herder timeout when getting position for newly created offsets partition.

Reply via email to