Re: Producer fails when old brokers are replaced by new

David Birdsong Mon, 03 Mar 2014 16:18:07 -0800

On Mon, Mar 3, 2014 at 4:00 PM, Guozhang Wang <wangg...@gmail.com> wrote:


> Hi Chris,
>
> In 0.9 we will have just one "broker list", i.e. the list of brokers read
> from the config file will be updated during bootstraping and all the future
> metadata refresh operations. This feature should lift this limit you are
> describing, for example, if your broker list in config is {1,2,3}, and
> later on you extend the cluster to {1,2,3,4,5,6}, then now you can shut
> down 1,2,3 all at once.
>

But if you producer or consumer ever restarts and only knows about {1,2,3},
the problem still exists no?

This is why I bootstrap off of zk and expect to have to maintain an
accurate list of zk nodes to all processes.


>
> Guozhang
>
>
> On Mon, Mar 3, 2014 at 1:35 PM, Christofer Hedbrandh <
> christo...@knewton.com
> > wrote:
>
> > Thanks again Guozhang.
> >
> > There are still some details here that are unclear to me, but if what I
> am
> > describing is not a bug, do you think it is reasonable to file this as a
> > feature request? We agree that it is not ideal to have to keep "at least
> > one broker in the list is alive", when replacing a cluster i.e. migrating
> > from one set of brokers to another?
> >
> > Christofer
> >
> >
> >
> > On Wed, Feb 26, 2014 at 9:16 PM, Guozhang Wang <wangg...@gmail.com>
> wrote:
> >
> > > kafka-preferred-replica-election.sh is only used to move leaders
> between
> > > brokers, as long as the broker in the broker.metadata.list, i.e. the
> > second
> > > broker list I mentioned in previous email is still alive then the
> > producer
> > > can learn the leader change from it.
> > >
> > > In terms of broker discovery, I think it depends on how you "define"
> the
> > > future. For example, originally there are 3 brokers 1,2,3, and you
> start
> > > the producer with metadata list = {1,2,3}, and later on another three
> > > brokers 4,5,6 are added, the producer can still find these newly added
> > > brokers. It is just that if all the brokers in the metadata list, i.e.
> > > 1,2,3 are gone, then the producer will not be able to refresh its
> > metadata.
> > >
> > > Guozhang
> > >
> > >
> > > On Wed, Feb 26, 2014 at 11:04 AM, Christofer Hedbrandh <
> > > christo...@knewton.com> wrote:
> > >
> > > > Thanks for your response Guozhang.
> > > >
> > > > I did make sure that new meta data is fetched before taking out the
> old
> > > > broker. I set the topic.metadata.refresh.interval.ms to something
> very
> > > > low,
> > > > and I confirm in the producer log that new meta data is actually
> > fetched,
> > > > after the new broker is brought up, and before the old broker is
> taken
> > > > down. Does this not mean that the dynamic current brokers list would
> > hold
> > > > the new broker at this point?
> > > >
> > > > If you are saying that the dynamic current brokers list is never used
> > for
> > > > fetching meta data, this does not explain how the producer does NOT
> > fail
> > > > when kafka-preferred-replica-election.sh makes the new broker become
> > the
> > > > leader.
> > > >
> > > > Lastly, if broker discovery is not a producer feature in 0.8.0
> Release,
> > > and
> > > > I have to "make sure at least one broker in the list is alive during
> > the
> > > > rolling bounce", is this a feature you are considering for future
> > > versions?
> > > >
> > > >
> > > >
> > > > On Wed, Feb 26, 2014 at 12:04 PM, Guozhang Wang <wangg...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hello Chris,
> > > > >
> > > > > The broker.metadata.list, once read in at start up time, will not
> be
> > > > > changed. In other words, during the life time of a producer it has
> > two
> > > > > lists of brokers:
> > > > >
> > > > > 1. The current brokers in the cluster that is returned in the
> > metadata
> > > > > request response, which is dynamic
> > > > >
> > > > > 2. The broker list that is used for bootstraping, this is read from
> > > > > broker.metadata.list and is fixed. This list could for example be a
> > VIP
> > > > and
> > > > > a hardware load balancer behind it will distribute the metadata
> > > requests
> > > > to
> > > > > the brokers.
> > > > >
> > > > > So in your case, the metadata list only has broker B, and once it
> is
> > > > taken
> > > > > out and the producer failed to send message to it and hence tries
> to
> > > > > refresh its metadata, it has no broker to go.
> > > > >
> > > > > Therefore, when you are trying to do a rolling bounce of the
> cluster
> > > to,
> > > > > for example, do a in-place upgrade, you need to make sure at least
> > one
> > > > > broker in the list is alive during the rolling bounce.
> > > > >
> > > > > Hope this helps.
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Feb 26, 2014 at 8:19 AM, Christofer Hedbrandh <
> > > > > christo...@knewton.com> wrote:
> > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > I ran into a problem with the Kafka producer when attempting to
> > > replace
> > > > > all
> > > > > > the nodes in a 0.8.0 Beta1 Release Kafka cluster, with 0.8.0
> > Release
> > > > > nodes.
> > > > > > I started a producer/consumer test program to measure the
> clusters
> > > > > > performance during the process, I added new brokers, I ran
> > > > > > kafka-reassign-partitions.sh, and I removed the old brokers.
> When I
> > > > > removed
> > > > > > the old brokers my producer failed.
> > > > > >
> > > > > > The simplest scenario that I could come up with where I still see
> > > this
> > > > > > behavior is this. Using version 0.8.0 Release, we have a 1
> > partition
> > > > > topic
> > > > > > with 2 replicas on 2 brokers, broker A and broker B. Broker A is
> > > taken
> > > > > > down. A producer is started with only broker B in the
> > > > > metadata.broker.list.
> > > > > > Broker A is brought back up. We let
> > > > > > topic.metadata.refresh.interval.msamount of time pass. Broker B
> is
> > > > > > taken down, and we get
> > > > > > kafka.common.FailedToSendMessageException after all the (many)
> > > retries
> > > > > have
> > > > > > failed.
> > > > > >
> > > > > > During my experimentation I have made sure that the producer
> > fetches
> > > > meta
> > > > > > data before the old broker is taken down. And I have made sure
> that
> > > > > enough
> > > > > > retries with enough backoff time were used for the producer to
> not
> > > give
> > > > > up
> > > > > > prematurely.
> > > > > >
> > > > > > The documentation for the producer config metadata.broker.list
> > > suggests
> > > > > to
> > > > > > me that this list of brokers is only used at startup. "This is
> for
> > > > > > bootstrapping and the producer will only use it for getting
> > metadata
> > > > > > (topics, partitions and replicas)". And when I read about
> > > > > > topic.metadata.refresh.interval.ms and retry.backoff.ms I learn
> > that
> > > > > meta
> > > > > > data is indeed fetched at later times. Based on this
> > documentation, I
> > > > > make
> > > > > > the assumption that the producer would learn about any new
> brokers
> > > when
> > > > > new
> > > > > > meta data is fetched.
> > > > > >
> > > > > > I also want to point out that the cluster seems to work just fine
> > > > during
> > > > > > this process, it only seems to be a problem with the producer.
> > > Between
> > > > > all
> > > > > > these steps I run kafka-list-topic.sh, I try the console producer
> > and
> > > > > > consumer, and everything is as expected.
> > > > > >
> > > > > > Also I found another interesting thing when experimenting with
> > > running
> > > > > > kafka-preferred-replica-election.sh before taking down the old
> > > broker.
> > > > > This
> > > > > > script only causes any changes when the leader and the preferred
> > > > replica
> > > > > > are different. In the scenario when they are in fact different,
> and
> > > the
> > > > > new
> > > > > > broker takes the role of leader from the old broker, the producer
> > > does
> > > > > NOT
> > > > > > fail. This makes me think that perhaps the producer only keeps
> meta
> > > > data
> > > > > > about topic leaders and not all replicas, as the documentation
> > > suggests
> > > > > to
> > > > > > me.
> > > > > >
> > > > > > It is clear that I am making a lot of assumptions here, and I am
> > > > > relatively
> > > > > > new to Kafka so I could very well me missing something important.
> > The
> > > > > way I
> > > > > > see it, there are a few possibilities.
> > > > > >
> > > > > > 1. Broker discovery is a supposed producer feature, and it has a
> > bug.
> > > > > > 2. Broker discovery is not a producer feature, in which case I
> > think
> > > > many
> > > > > > people might benefit from a clearer documentation.
> > > > > > 3. I am doing something dumb e.g. forgetting about some important
> > > > > > configuration.
> > > > > >
> > > > > > Please let me know what you make of this.
> > > > > >
> > > > > > Thanks,
> > > > > > Christofer Hedbrandh
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -- Guozhang
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: Producer fails when old brokers are replaced by new

Reply via email to