Some more detail on this issue:
On a hunch I tried restarting my docker-compose stack a few more times.
Still same problem, my application using the Kafka Client APIs claims it
is talking to Kafka, but the Kafka logs disagree.
So I restarted the stack once more. With 'docker-compose up' this is a
very clean start. Then I waited about 10 minutes until I saw
kafka-1_1 | [2017-09-13 15:00:47,214] INFO [Group Metadata Manager on
Broker 1001]: Removed 0 expired offsets in 0 milliseconds.
(kafka.coordinator.GroupMetadataManager)
kafka-3_1 | [2017-09-13 15:00:47,595] INFO [Group Metadata Manager on
Broker 1002]: Removed 0 expired offsets in 0 milliseconds.
(kafka.coordinator.GroupMetadataManager)
kafka-2_1 | [2017-09-13 15:00:47,771] INFO [Group Metadata Manager on
Broker 1003]: Removed 0 expired offsets in 0 milliseconds.
(kafka.coordinator.GroupMetadataManager)
in the log. When I reran my application, suddenly the Kafka logs come
alive with indications they are creating topics, et al.
I have been using Kafka for a couple months now, and this is very new
behavior I have not seen until a week ago. I am used to being able to
run my application immediately after the Kafka Stack comes up in Docker.
Operationally now it seems I have to wait 10 minutes after starting Kafka.
Of course I am still dealing with the NotLeaderForPartitionException
problem, which is also new, and breaks my application, but at least I
seem to have a repeatable path to that problem.
Cheers, Eric
On 2017-09-12 2:43 PM, Eric Kolotyluk wrote:
The last few days I have been seeing a problem I do not know how to
explain.
For months I have been successfully running Kafka/Zookeeper under
docker, and my application seems to work fine. Lately, when I run
Kafka under either docker-compose on my developer system, or 'docker
stack deploy' on a Docker Swarm on AWS, here is what I am seeing:
According to the logs, Zookeeper/Kafka seem to start okay, and the 3
brokers I have configured seem to find each other. The logs look
pretty normal. Then I start my application, and my application logs
show that it has connected to the Kafka Cluster okay, it indicates
that it has created the topics okay. However, there is nothing in the
Kafka logs to show any kind of connection from my application, let
along topics being created. Sure enough, when I rerun my application,
it cannot find the topics, it tries to create them again, and gets a
successful response from the Kafka Admin Client. Nope, they were not
created.
When I shut down Kafka, the logs show the shutdown sequence for all
the brokers and zookeeper. I cannot understand why the Kafka Client
Library is not showing any errors when the Kafka logs are not showing
any connection or operations.
I tried both Kafka 0.11.0.0 and 0.10.2.1 -- same problem.
Been trying to figure out this problem all morning, bashing my head
against the wall.
*Then I go to lunch*, and a couple hours later I try one more time.
Behold, suddenly I can see the Kafka logs reporting they have created
the topics my application requested. But now I am stuck with the
infamous org.apache.kafka.common.errors.NotLeaderForPartitionException
problem again. This is another new problem that has started recently.
Unfortunately I have wasted hours and hours fighting the first problem
I have not been able to dig into this one.
What could possibly be the explanation for this not working, and then
working again after a few hours?
It seems insanely difficult to operate a Kafka cluster in any kind of
stable configuration that does not fail randomly.
Can anyone offer any kind of advice on what the problem might be?
It it better to just give up trying to operate our own Kafka cluster
and use Kinesis instead?
Cheers, Eric