That should be fine since the old socket in the producer will no longer be usable after a broker is restarted.
Thanks, Jun On Mon, Jun 24, 2013 at 9:50 PM, Jason Rosenberg <j...@squareup.com> wrote: > What about a non-controlled shutdown, and a restart, but the producer never > attempts to send anything during the time the broker was down? That could > have caused a leader change, but without the producer knowing to refresh > it's metadata, no? > > > On Mon, Jun 24, 2013 at 9:05 PM, Jun Rao <jun...@gmail.com> wrote: > > > Other than controlled shutdown, the only other case that can cause the > > leader to change when the underlying broker is alive is when the broker > > expires its ZK session (likely due to GC), which should be rare. That > being > > said, forwarding in the broker may not be a bad idea. Could you file a > jira > > to track this? > > > > Thanks, > > > > Jun > > > > > > On Mon, Jun 24, 2013 at 2:50 PM, Jason Rosenberg <j...@squareup.com> > wrote: > > > > > Yeah, > > > > > > I see that with ack=0, the producer will be in a bad state anytime the > > > leader for it's partition has changed, while the broker that it thinks > is > > > the leader is still up. So this is a problem in general, not only for > > > controlled shutdown, but even for the case where you've restarted a > > server > > > (without controlled shutdown), which in and of itself can force a > leader > > > change. If the producer doesn't attempt to send a message during the > > time > > > the broker was down, it will never get a connection failure, and never > > get > > > fresh metadata, and subsequently start sending messages to the > > non-leader. > > > > > > Thus, I'd say this is a problem with ack=0, regardless of controlled > > > shutdown. Any time there's a leader change, the producer will send > > > messages into the ether. I think this is actually a severe condition, > > that > > > could be considered a bug. How hard would it be to have the receiving > > > broker forward on to the leader, in this case? > > > > > > Jason > > > > > > > > > On Mon, Jun 24, 2013 at 8:44 AM, Joel Koshy <jjkosh...@gmail.com> > wrote: > > > > > > > I think Jason was suggesting quiescent time as a possibility only if > > the > > > > broker did request forwarding if it is not the leader. > > > > > > > > On Monday, June 24, 2013, Jun Rao wrote: > > > > > > > > > Jason, > > > > > > > > > > The quiescence time that you proposed won't work. The reason is > that > > > with > > > > > ack=0, the producer starts losing data silently from the moment the > > > > leader > > > > > is moved (by controlled shutdown) until the broker is shut down. > So, > > > the > > > > > sooner that you can shut down the broker, the better. What we > > realized > > > is > > > > > that if you can use a larger batch size, ack=1 can still deliver > very > > > > good > > > > > throughput. > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > > > > > > On Mon, Jun 24, 2013 at 12:22 AM, Jason Rosenberg < > j...@squareup.com > > > > <javascript:;>> > > > > > wrote: > > > > > > > > > > > Yeah I am using ack = 0, so that makes sense. I'll need to > rethink > > > > that, > > > > > > it would seem. It would be nice, wouldn't it, in this case, for > > the > > > > > broker > > > > > > to realize this and just forward the messages to the correct > > leader. > > > > > Would > > > > > > that be possible? > > > > > > > > > > > > Also, it would be nice to have a second option to the controlled > > > > shutdown > > > > > > (e.g. controlled.shutdown.quiescence.ms), to allow the broker to > > > wait > > > > > > after > > > > > > the controlled shutdown, a prescribed amount of time before > > actually > > > > > > shutting down the server. Then, I could set this value to > > something a > > > > > > little greater than the producer's ' > > > topic.metadata.refresh.interval.ms > > > > '. > > > > > > This would help with hitless rolling restarts too. Currently, > > every > > > > > > producer gets a very loud "Connection Reset" with a tall stack > > trace > > > > each > > > > > > time I restart a broker. Would be nicer to have the producers > > still > > > be > > > > > > able to produce until the metadata refresh interval expires, then > > get > > > > the > > > > > > word that the leader has moved due to the controlled shutdown, > and > > > then > > > > > > start producing to the new leader, all before the shutting down > > > server > > > > > > actually shuts down. Does that seem feasible? > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > On Sun, Jun 23, 2013 at 8:23 PM, Jun Rao <jun...@gmail.com > > > > <javascript:;>> > > > > > wrote: > > > > > > > > > > > > > Jason, > > > > > > > > > > > > > > Are you using ack = 0 in the producer? This mode doesn't work > > well > > > > with > > > > > > > controlled shutdown (this is explained in FAQ i*n > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#)* > > > > > > > * > > > > > > > * > > > > > > > Thanks, > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > On Sun, Jun 23, 2013 at 1:45 AM, Jason Rosenberg < > > j...@squareup.com > > > > <javascript:;> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I'm working on trying on having seamless rolling restarts for > > my > > > > > kafka > > > > > > > > servers, running 0.8. I have it so that each server will be > > > > > restarted > > > > > > > > sequentially. Each server takes itself out of the load > > balancer > > > > > (e.g. > > > > > > > sets > > > > > > > > a status that the lb will recognize, and then waits more than > > > long > > > > > > enough > > > > > > > > for the lb to stop sending meta-data requests to that > server). > > > > Then > > > > > I > > > > > > > > initiate the shutdown (with controlled.shutdown.enable=true). > > > This > > > > > > seems > > > > > > > > to work well, however, I occasionally see warnings like this > in > > > the > > > > > log > > > > > > > > from the server, after restart: > > > > > > > > > > > > > > > > 2013-06-23 08:28:46,770 WARN [kafka-request-handler-2] > > > > > > server.KafkaApis > > > > > > > - > > > > > > > > [KafkaApi-508818741] Produce request with correlation id > > 7136261 > > > > from > > > > > > > > client on partition [mytopic,0] failed due to Leader not > local > > > for > > > > > > > > partition [mytopic,0] on broker 508818741 > > > > > > > > > > > > > > > > This WARN seems to persistently repeat, until the producer > > client > > > > > > > initiates > > > > > > > > a new meta-data request (e.g. every 10 minutes, by default). > > > > > However, > > > > > > > the > > > > > > > > producer doesn't log any errors/exceptions when the server is > > > > logging > > > > > > > this > > > > > > > > WARN. > > > > > > > > > > > > > > > > What's happening here? Is the message silently being > forwarded > > > on > > > > to > > > > > > the > > > > > > > > correct leader for the partition? Is the message dropped? > Are > > > > these > > > > > > > WARNS > > > > > > > > particularly useful? > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >