Thanks again Guozhang. There are still some details here that are unclear to me, but if what I am describing is not a bug, do you think it is reasonable to file this as a feature request? We agree that it is not ideal to have to keep "at least one broker in the list is alive", when replacing a cluster i.e. migrating from one set of brokers to another?
Christofer On Wed, Feb 26, 2014 at 9:16 PM, Guozhang Wang <wangg...@gmail.com> wrote: > kafka-preferred-replica-election.sh is only used to move leaders between > brokers, as long as the broker in the broker.metadata.list, i.e. the second > broker list I mentioned in previous email is still alive then the producer > can learn the leader change from it. > > In terms of broker discovery, I think it depends on how you "define" the > future. For example, originally there are 3 brokers 1,2,3, and you start > the producer with metadata list = {1,2,3}, and later on another three > brokers 4,5,6 are added, the producer can still find these newly added > brokers. It is just that if all the brokers in the metadata list, i.e. > 1,2,3 are gone, then the producer will not be able to refresh its metadata. > > Guozhang > > > On Wed, Feb 26, 2014 at 11:04 AM, Christofer Hedbrandh < > christo...@knewton.com> wrote: > > > Thanks for your response Guozhang. > > > > I did make sure that new meta data is fetched before taking out the old > > broker. I set the topic.metadata.refresh.interval.ms to something very > > low, > > and I confirm in the producer log that new meta data is actually fetched, > > after the new broker is brought up, and before the old broker is taken > > down. Does this not mean that the dynamic current brokers list would hold > > the new broker at this point? > > > > If you are saying that the dynamic current brokers list is never used for > > fetching meta data, this does not explain how the producer does NOT fail > > when kafka-preferred-replica-election.sh makes the new broker become the > > leader. > > > > Lastly, if broker discovery is not a producer feature in 0.8.0 Release, > and > > I have to "make sure at least one broker in the list is alive during the > > rolling bounce", is this a feature you are considering for future > versions? > > > > > > > > On Wed, Feb 26, 2014 at 12:04 PM, Guozhang Wang <wangg...@gmail.com> > > wrote: > > > > > Hello Chris, > > > > > > The broker.metadata.list, once read in at start up time, will not be > > > changed. In other words, during the life time of a producer it has two > > > lists of brokers: > > > > > > 1. The current brokers in the cluster that is returned in the metadata > > > request response, which is dynamic > > > > > > 2. The broker list that is used for bootstraping, this is read from > > > broker.metadata.list and is fixed. This list could for example be a VIP > > and > > > a hardware load balancer behind it will distribute the metadata > requests > > to > > > the brokers. > > > > > > So in your case, the metadata list only has broker B, and once it is > > taken > > > out and the producer failed to send message to it and hence tries to > > > refresh its metadata, it has no broker to go. > > > > > > Therefore, when you are trying to do a rolling bounce of the cluster > to, > > > for example, do a in-place upgrade, you need to make sure at least one > > > broker in the list is alive during the rolling bounce. > > > > > > Hope this helps. > > > > > > Guozhang > > > > > > > > > > > > > > > > > > On Wed, Feb 26, 2014 at 8:19 AM, Christofer Hedbrandh < > > > christo...@knewton.com> wrote: > > > > > > > Hi all, > > > > > > > > I ran into a problem with the Kafka producer when attempting to > replace > > > all > > > > the nodes in a 0.8.0 Beta1 Release Kafka cluster, with 0.8.0 Release > > > nodes. > > > > I started a producer/consumer test program to measure the clusters > > > > performance during the process, I added new brokers, I ran > > > > kafka-reassign-partitions.sh, and I removed the old brokers. When I > > > removed > > > > the old brokers my producer failed. > > > > > > > > The simplest scenario that I could come up with where I still see > this > > > > behavior is this. Using version 0.8.0 Release, we have a 1 partition > > > topic > > > > with 2 replicas on 2 brokers, broker A and broker B. Broker A is > taken > > > > down. A producer is started with only broker B in the > > > metadata.broker.list. > > > > Broker A is brought back up. We let > > > > topic.metadata.refresh.interval.msamount of time pass. Broker B is > > > > taken down, and we get > > > > kafka.common.FailedToSendMessageException after all the (many) > retries > > > have > > > > failed. > > > > > > > > During my experimentation I have made sure that the producer fetches > > meta > > > > data before the old broker is taken down. And I have made sure that > > > enough > > > > retries with enough backoff time were used for the producer to not > give > > > up > > > > prematurely. > > > > > > > > The documentation for the producer config metadata.broker.list > suggests > > > to > > > > me that this list of brokers is only used at startup. "This is for > > > > bootstrapping and the producer will only use it for getting metadata > > > > (topics, partitions and replicas)". And when I read about > > > > topic.metadata.refresh.interval.ms and retry.backoff.ms I learn that > > > meta > > > > data is indeed fetched at later times. Based on this documentation, I > > > make > > > > the assumption that the producer would learn about any new brokers > when > > > new > > > > meta data is fetched. > > > > > > > > I also want to point out that the cluster seems to work just fine > > during > > > > this process, it only seems to be a problem with the producer. > Between > > > all > > > > these steps I run kafka-list-topic.sh, I try the console producer and > > > > consumer, and everything is as expected. > > > > > > > > Also I found another interesting thing when experimenting with > running > > > > kafka-preferred-replica-election.sh before taking down the old > broker. > > > This > > > > script only causes any changes when the leader and the preferred > > replica > > > > are different. In the scenario when they are in fact different, and > the > > > new > > > > broker takes the role of leader from the old broker, the producer > does > > > NOT > > > > fail. This makes me think that perhaps the producer only keeps meta > > data > > > > about topic leaders and not all replicas, as the documentation > suggests > > > to > > > > me. > > > > > > > > It is clear that I am making a lot of assumptions here, and I am > > > relatively > > > > new to Kafka so I could very well me missing something important. The > > > way I > > > > see it, there are a few possibilities. > > > > > > > > 1. Broker discovery is a supposed producer feature, and it has a bug. > > > > 2. Broker discovery is not a producer feature, in which case I think > > many > > > > people might benefit from a clearer documentation. > > > > 3. I am doing something dumb e.g. forgetting about some important > > > > configuration. > > > > > > > > Please let me know what you make of this. > > > > > > > > Thanks, > > > > Christofer Hedbrandh > > > > > > > > > > > > > > > > -- > > > -- Guozhang > > > > > > > > > -- > -- Guozhang >