The last few days I have been seeing a problem I do not know how to explain.
For months I have been successfully running Kafka/Zookeeper under
docker, and my application seems to work fine. Lately, when I run Kafka
under either docker-compose on my developer system, or 'docker stack
deploy' on a Docker Swarm on AWS, here is what I am seeing:
According to the logs, Zookeeper/Kafka seem to start okay, and the 3
brokers I have configured seem to find each other. The logs look pretty
normal. Then I start my application, and my application logs show that
it has connected to the Kafka Cluster okay, it indicates that it has
created the topics okay. However, there is nothing in the Kafka logs to
show any kind of connection from my application, let along topics being
created. Sure enough, when I rerun my application, it cannot find the
topics, it tries to create them again, and gets a successful response
from the Kafka Admin Client. Nope, they were not created.
When I shut down Kafka, the logs show the shutdown sequence for all the
brokers and zookeeper. I cannot understand why the Kafka Client Library
is not showing any errors when the Kafka logs are not showing any
connection or operations.
I tried both Kafka 0.11.0.0 and 0.10.2.1 -- same problem.
Been trying to figure out this problem all morning, bashing my head
against the wall.
*Then I go to lunch*, and a couple hours later I try one more time.
Behold, suddenly I can see the Kafka logs reporting they have created
the topics my application requested. But now I am stuck with the
infamous org.apache.kafka.common.errors.NotLeaderForPartitionException
problem again. This is another new problem that has started recently.
Unfortunately I have wasted hours and hours fighting the first problem I
have not been able to dig into this one.
What could possibly be the explanation for this not working, and then
working again after a few hours?
It seems insanely difficult to operate a Kafka cluster in any kind of
stable configuration that does not fail randomly.
Can anyone offer any kind of advice on what the problem might be?
It it better to just give up trying to operate our own Kafka cluster and
use Kinesis instead?
Cheers, Eric
# Compose a collection of Docker containers used by Spacejam/Madlands server
# See README in this directory
# Makes a network so the docker image can use the hosts by name
# https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
# https://docs.docker.com/compose/networking/
#
https://docs.google.com/document/d/1isfM3HI-Rxbal9l_v2dyU6pl7CZMpQ_r2irkiMag2vE/edit#heading=h.krkqmakfnk6n
version: '3.2'
services:
mysql:
image: iggcanada/mysql
ports:
- "3306:3306"
environment:
MYSQL_ROOT_PASSWORD: igg
MYSQL_USER: igg
MYSQL_PASSWORD: igg
networks:
- madlands
redis:
image: redis:latest
ports:
- "6379:6379"
networks:
- madlands
# Zookeeper is basically the distributed database and name server for Kafka
brokers.
# The brokers use this to find each other, and coordinate information amongst
themselves.
# In a single developer environment we only start one Zookeeper agent as we
don't really need
# multiple agents. In a production environment we would run as many agents as
we could.
zookeeper:
image: wurstmeister/zookeeper
depends_on:
- redis
- mysql
ports:
- "2181:2181"
networks:
- madlands
# Kafka is our message broker which provides a durable PubSub service. In a
single developer
# environment we don't really need to start 3 brokers, but in this case it's
useful to better
# simulate the behavior of the cluster in a production environment with
multiple partitions
# and replication coordinated between multiple brokers. Because we are in a
single Docker
# stack, we use the ports 9094, 9095, and 9096 via Docker port mapping to reach
the brokers
# in each container. For example, in your application.conf file you should have
# kafka.connect.bootstrap.servers =
"127.0.0.1:9094,127.0.0.1:9095,127.0.0.1:9096"
# https://kafka.apache.org/documentation/#configuration
#
https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.1/bk_kafka-component-guide/content/prepare-kafka-env.html
#
https://cwiki.apache.org/confluence/display/KAFKA/KIP-103%3A+Separation+of+Internal+and+External+traffic
# http://wurstmeister.github.io/kafka-docker
# Note: this document is out of date wrt Listener Configuration, and trying to
follow that can lead to frustration
# and misery. The configuration below is the result of extensive trial and
error, testing, and more testing.
# If you want to customise any Kafka parameters, simply add them as environment
variables
# For example: delete.topic.enable=true becomes KAFKA_DELETE_TOPIC_ENABLE:
"true"
# KAFKA_LISTENERS
# "Listener List - Comma-separated list of URIs we will listen on and the
listener names. If the listener name is not
# a security protocol, listener.security.protocol.map must also be set. Specify
hostname as 0.0.0.0 to bind to all
# interfaces. Leave hostname empty to bind to default interface. Examples of
legal listener lists:
# PLAINTEXT://myhost:9092,SSL://:9091
CLIENT://0.0.0.0:9092,REPLICATION://localhost:9093"
# KAFKA_ADVERTISED_LISTENERS
# "Listeners to publish to ZooKeeper for clients to use, if different than the
listeners above. In IaaS environments,
# this may need to be different from the interface to which the broker binds.
If this is not set, the value for
# `listeners` will be used."
# KAFKA_LISTENER_SECURITY_PROTOCOL_MAP
# "Map between listener names and security protocols. This must be defined for
the same security protocol to be usable
# in more than one port or IP. For example, we can separate internal and
external traffic even if SSL is required for
# both. Concretely, we could define listeners with names INTERNAL and EXTERNAL
and this property as:
# `INTERNAL:SSL,EXTERNAL:SSL`. As shown, key and value are separated by a colon
and map entries are separated by
# commas. Each listener name should only appear once in the map."
# KAFKA_INTER_BROKER_LISTENER_NAME
# "Name of listener used for communication between brokers. If this is unset,
the listener name is defined by
# security.inter.broker.protocol. It is an error to set this and
security.inter.broker.protocol properties at
# the same time."
# KAFKA_INTER_BROKER_PROTOCOL_VERSION
# "Specify which version of the inter-broker protocol will be used. This is
typically bumped after all brokers were
# upgraded to a new version. Example of some valid values are: 0.8.0, 0.8.1,
0.8.1.1, 0.8.2, 0.8.2.0, 0.8.2.1, 0.9.0.0,
# 0.9.0.1 Check ApiVersion for the full list."
# KAFKA_DELETE_TOPIC_ENABLE
# We use this so that the Kafka support in the server has complete control over
topic management. The philosophy
# is that it's better to automate this in the server, than document it as a
manual DevOps process. An argument
# could be made that allowing the server to delete topics is too dangerous to
leave to the discretion of
# software developers.
# KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS
# There is a bug in the version of Kafka we are using, in that the
configuration system is not able to
# resolve the default value, so we have to explicitly set it.
# KAFKA_LOG_RETENTION_BYTES
# Don't really need to set this as we are using the default value, but just
expose it here as as reminder
# to consider setting it.
# KAFKA_LOG_RETENTION_DAYS
# We change this from the default of 7, because we really don't expect to need
to retain messages for 7 days.
# The hostnames kafka-1, kafka-2, kafka-3, etc. are only visible within the
Docker stack, as Docker has its own
# embedded DNS. These names will also be shared via Zookeeper with the brokers.
However, these names are not
# resolvable outside the stack.
# By default, PLAINTEXT is default 'inside' listener on 9092. In some
production situations we might also define
# a BROKER listener on hostname:9093 for inter-broker communication, such as
replication. The listener name
# REPLICATION is also used for this purpose. If you use this extra listener,
then you also need to set the property
# KAFKA_INTER_BROKER_LISTENER_NAME to BROKER or REPLICATION. We define the
listener OUTSIDE to listen on 0.0.0.0
# (all interfaces), which means externally we can connect to Kafka at 127.0.0.1
in our developer environment.
# In a production environment, we might use SASL and/or SSL on the OUTSIDE
listener.
# Note: PLAINTEXT is uses as both a listener name, and a security protocol
name, which can seem confusing.
# The KAFKA_LISTENER_SECURITY_PROTOCOL_MAP here is overkill, but includes the
default setting, plus OUTSIDE.
# We do not use KAFKA_ADVERTISED_HOST_NAME as that property is now deprecated.
Also, using it required that we add
# 127.0.0.1 localhost kafka-1 kafka-2 kafka-3 to our /etc/hosts file. It would
be less confusing if wurstmeister
# would remove all references to deprecated Kafka properties from his
documentation and examples.
kafka-1:
image: iggcanada/kafka
depends_on:
- zookeeper
- redis
- mysql
ports:
- target: 9094
published: 9094
protocol: tcp
mode: host
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:
PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: PLAINTEXT://:9092,OUTSIDE://0.0.0.0:9094
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://$$(exec
hostname):9092,OUTSIDE://0.0.0.0:9094
# KAFKA_INTER_BROKER_PROTOCOL_VERSION: 0.11.0.0
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "false"
KAFKA_DELETE_TOPIC_ENABLE: "true"
KAFKA_LOG_RETENTION_BYTES: -1
KAFKA_LOG_RETENTION_DAYS: 2
KAFKA_LOG4J_ROOT_LOGLEVEL: "DEBUG"
KAFKA_SESSION_TIMEOUT_MS: 50000
# Required because of bugs in Kafka 0.11.0.0
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 3000
volumes:
- /var/run/docker.sock:/var/run/docker.sock
networks:
- madlands
kafka-2:
image: iggcanada/kafka
depends_on:
- zookeeper
- redis
- mysql
ports:
- target: 9094
published: 9095
protocol: tcp
mode: host
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:
PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: PLAINTEXT://:9092,OUTSIDE://0.0.0.0:9094
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://$$(exec
hostname):9092,OUTSIDE://0.0.0.0:9094
# KAFKA_INTER_BROKER_PROTOCOL_VERSION: 0.11.0.0
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "false"
KAFKA_DELETE_TOPIC_ENABLE: "true"
KAFKA_LOG_RETENTION_BYTES: -1
KAFKA_LOG_RETENTION_DAYS: 2
KAFKA_LOG4J_ROOT_LOGLEVEL: "DEBUG"
KAFKA_SESSION_TIMEOUT_MS: 50000
# Required because of bugs in Kafka 0.11.0.0
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 3000
volumes:
- /var/run/docker.sock:/var/run/docker.sock
networks:
- madlands
kafka-3:
image: iggcanada/kafka
depends_on:
- zookeeper
- redis
- mysql
ports:
- target: 9094
published: 9096
protocol: tcp
mode: host
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP:
PLAINTEXT:PLAINTEXT,SSL:SSL,SASL_PLAINTEXT:SASL_PLAINTEXT,SASL_SSL:SASL_SSL,OUTSIDE:PLAINTEXT
KAFKA_LISTENERS: PLAINTEXT://:9092,OUTSIDE://0.0.0.0:9094
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://$$(exec
hostname):9092,OUTSIDE://0.0.0.0:9094
# KAFKA_INTER_BROKER_PROTOCOL_VERSION: 0.11.0.0
KAFKA_AUTO_CREATE_TOPICS_ENABLE: "false"
KAFKA_DELETE_TOPIC_ENABLE: "true"
KAFKA_LOG_RETENTION_BYTES: -1
KAFKA_LOG_RETENTION_DAYS: 2
KAFKA_LOG4J_ROOT_LOGLEVEL: "DEBUG"
KAFKA_SESSION_TIMEOUT_MS: 50000
# Required because of bugs in Kafka 0.11.0.0
KAFKA_GROUP_INITIAL_REBALANCE_DELAY_MS: 3000
volumes:
- /var/run/docker.sock:/var/run/docker.sock
networks:
- madlands
# Create our named network of type bridge
# For some reason could not get server to connect to Kafka using
# the default bridge network or host network. Not sure why only
# custom bridge network works? EK
networks:
madlands:
driver: bridge