Hi all, I ran into a problem with the Kafka producer when attempting to replace all the nodes in a 0.8.0 Beta1 Release Kafka cluster, with 0.8.0 Release nodes. I started a producer/consumer test program to measure the clusters performance during the process, I added new brokers, I ran kafka-reassign-partitions.sh, and I removed the old brokers. When I removed the old brokers my producer failed.
The simplest scenario that I could come up with where I still see this behavior is this. Using version 0.8.0 Release, we have a 1 partition topic with 2 replicas on 2 brokers, broker A and broker B. Broker A is taken down. A producer is started with only broker B in the metadata.broker.list. Broker A is brought back up. We let topic.metadata.refresh.interval.msamount of time pass. Broker B is taken down, and we get kafka.common.FailedToSendMessageException after all the (many) retries have failed. During my experimentation I have made sure that the producer fetches meta data before the old broker is taken down. And I have made sure that enough retries with enough backoff time were used for the producer to not give up prematurely. The documentation for the producer config metadata.broker.list suggests to me that this list of brokers is only used at startup. "This is for bootstrapping and the producer will only use it for getting metadata (topics, partitions and replicas)". And when I read about topic.metadata.refresh.interval.ms and retry.backoff.ms I learn that meta data is indeed fetched at later times. Based on this documentation, I make the assumption that the producer would learn about any new brokers when new meta data is fetched. I also want to point out that the cluster seems to work just fine during this process, it only seems to be a problem with the producer. Between all these steps I run kafka-list-topic.sh, I try the console producer and consumer, and everything is as expected. Also I found another interesting thing when experimenting with running kafka-preferred-replica-election.sh before taking down the old broker. This script only causes any changes when the leader and the preferred replica are different. In the scenario when they are in fact different, and the new broker takes the role of leader from the old broker, the producer does NOT fail. This makes me think that perhaps the producer only keeps meta data about topic leaders and not all replicas, as the documentation suggests to me. It is clear that I am making a lot of assumptions here, and I am relatively new to Kafka so I could very well me missing something important. The way I see it, there are a few possibilities. 1. Broker discovery is a supposed producer feature, and it has a bug. 2. Broker discovery is not a producer feature, in which case I think many people might benefit from a clearer documentation. 3. I am doing something dumb e.g. forgetting about some important configuration. Please let me know what you make of this. Thanks, Christofer Hedbrandh