Filed https://issues.apache.org/jira/browse/KAFKA-955
On Mon, Jun 24, 2013 at 10:14 PM, Jason Rosenberg <j...@squareup.com> wrote: > Jun, > > To be clear, this whole discussion was started, because I am clearly > seeing "failed due to Leader not local" on the last broker restarted, > after all the controlled shutting down has completed and all brokers > restarted. > > This leads me to believe that a client made a meta data request and found > out that server A was the leader for it's partition, and then server A was > restarted, and then the client makes repeated producer requests to server > A, without encountering a broken socket. Thus, I'm not sure it's correct > that the socket is invalidated in that case after a restart. > > Alternatively, could it be that the client (which sends messages to > multiple topics), gets metadata updates for multiple topics, but doesn't > attempt to send a message to topicX until after the leader has changed and > server A has been restarted. In this case, if it's the first time the > producer sends to topicX, does it only then create a new socket? > > Jason > > > On Mon, Jun 24, 2013 at 10:00 PM, Jun Rao <jun...@gmail.com> wrote: > >> That should be fine since the old socket in the producer will no longer be >> usable after a broker is restarted. >> >> Thanks, >> >> Jun >> >> >> On Mon, Jun 24, 2013 at 9:50 PM, Jason Rosenberg <j...@squareup.com> >> wrote: >> >> > What about a non-controlled shutdown, and a restart, but the producer >> never >> > attempts to send anything during the time the broker was down? That >> could >> > have caused a leader change, but without the producer knowing to refresh >> > it's metadata, no? >> > >> > >> > On Mon, Jun 24, 2013 at 9:05 PM, Jun Rao <jun...@gmail.com> wrote: >> > >> > > Other than controlled shutdown, the only other case that can cause the >> > > leader to change when the underlying broker is alive is when the >> broker >> > > expires its ZK session (likely due to GC), which should be rare. That >> > being >> > > said, forwarding in the broker may not be a bad idea. Could you file a >> > jira >> > > to track this? >> > > >> > > Thanks, >> > > >> > > Jun >> > > >> > > >> > > On Mon, Jun 24, 2013 at 2:50 PM, Jason Rosenberg <j...@squareup.com> >> > wrote: >> > > >> > > > Yeah, >> > > > >> > > > I see that with ack=0, the producer will be in a bad state anytime >> the >> > > > leader for it's partition has changed, while the broker that it >> thinks >> > is >> > > > the leader is still up. So this is a problem in general, not only >> for >> > > > controlled shutdown, but even for the case where you've restarted a >> > > server >> > > > (without controlled shutdown), which in and of itself can force a >> > leader >> > > > change. If the producer doesn't attempt to send a message during >> the >> > > time >> > > > the broker was down, it will never get a connection failure, and >> never >> > > get >> > > > fresh metadata, and subsequently start sending messages to the >> > > non-leader. >> > > > >> > > > Thus, I'd say this is a problem with ack=0, regardless of controlled >> > > > shutdown. Any time there's a leader change, the producer will send >> > > > messages into the ether. I think this is actually a severe >> condition, >> > > that >> > > > could be considered a bug. How hard would it be to have the >> receiving >> > > > broker forward on to the leader, in this case? >> > > > >> > > > Jason >> > > > >> > > > >> > > > On Mon, Jun 24, 2013 at 8:44 AM, Joel Koshy <jjkosh...@gmail.com> >> > wrote: >> > > > >> > > > > I think Jason was suggesting quiescent time as a possibility only >> if >> > > the >> > > > > broker did request forwarding if it is not the leader. >> > > > > >> > > > > On Monday, June 24, 2013, Jun Rao wrote: >> > > > > >> > > > > > Jason, >> > > > > > >> > > > > > The quiescence time that you proposed won't work. The reason is >> > that >> > > > with >> > > > > > ack=0, the producer starts losing data silently from the moment >> the >> > > > > leader >> > > > > > is moved (by controlled shutdown) until the broker is shut down. >> > So, >> > > > the >> > > > > > sooner that you can shut down the broker, the better. What we >> > > realized >> > > > is >> > > > > > that if you can use a larger batch size, ack=1 can still deliver >> > very >> > > > > good >> > > > > > throughput. >> > > > > > >> > > > > > Thanks, >> > > > > > >> > > > > > Jun >> > > > > > >> > > > > > >> > > > > > On Mon, Jun 24, 2013 at 12:22 AM, Jason Rosenberg < >> > j...@squareup.com >> > > > > <javascript:;>> >> > > > > > wrote: >> > > > > > >> > > > > > > Yeah I am using ack = 0, so that makes sense. I'll need to >> > rethink >> > > > > that, >> > > > > > > it would seem. It would be nice, wouldn't it, in this case, >> for >> > > the >> > > > > > broker >> > > > > > > to realize this and just forward the messages to the correct >> > > leader. >> > > > > > Would >> > > > > > > that be possible? >> > > > > > > >> > > > > > > Also, it would be nice to have a second option to the >> controlled >> > > > > shutdown >> > > > > > > (e.g. controlled.shutdown.quiescence.ms), to allow the >> broker to >> > > > wait >> > > > > > > after >> > > > > > > the controlled shutdown, a prescribed amount of time before >> > > actually >> > > > > > > shutting down the server. Then, I could set this value to >> > > something a >> > > > > > > little greater than the producer's ' >> > > > topic.metadata.refresh.interval.ms >> > > > > '. >> > > > > > > This would help with hitless rolling restarts too. >> Currently, >> > > every >> > > > > > > producer gets a very loud "Connection Reset" with a tall stack >> > > trace >> > > > > each >> > > > > > > time I restart a broker. Would be nicer to have the producers >> > > still >> > > > be >> > > > > > > able to produce until the metadata refresh interval expires, >> then >> > > get >> > > > > the >> > > > > > > word that the leader has moved due to the controlled shutdown, >> > and >> > > > then >> > > > > > > start producing to the new leader, all before the shutting >> down >> > > > server >> > > > > > > actually shuts down. Does that seem feasible? >> > > > > > > >> > > > > > > Jason >> > > > > > > >> > > > > > > >> > > > > > > On Sun, Jun 23, 2013 at 8:23 PM, Jun Rao <jun...@gmail.com >> > > > > <javascript:;>> >> > > > > > wrote: >> > > > > > > >> > > > > > > > Jason, >> > > > > > > > >> > > > > > > > Are you using ack = 0 in the producer? This mode doesn't >> work >> > > well >> > > > > with >> > > > > > > > controlled shutdown (this is explained in FAQ i*n >> > > > > > > > >> > > > > >> > https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#)* >> > > > > > > > * >> > > > > > > > * >> > > > > > > > Thanks, >> > > > > > > > >> > > > > > > > Jun >> > > > > > > > >> > > > > > > > >> > > > > > > > On Sun, Jun 23, 2013 at 1:45 AM, Jason Rosenberg < >> > > j...@squareup.com >> > > > > <javascript:;> >> > > > > > > >> > > > > > > wrote: >> > > > > > > > >> > > > > > > > > I'm working on trying on having seamless rolling restarts >> for >> > > my >> > > > > > kafka >> > > > > > > > > servers, running 0.8. I have it so that each server will >> be >> > > > > > restarted >> > > > > > > > > sequentially. Each server takes itself out of the load >> > > balancer >> > > > > > (e.g. >> > > > > > > > sets >> > > > > > > > > a status that the lb will recognize, and then waits more >> than >> > > > long >> > > > > > > enough >> > > > > > > > > for the lb to stop sending meta-data requests to that >> > server). >> > > > > Then >> > > > > > I >> > > > > > > > > initiate the shutdown (with >> controlled.shutdown.enable=true). >> > > > This >> > > > > > > seems >> > > > > > > > > to work well, however, I occasionally see warnings like >> this >> > in >> > > > the >> > > > > > log >> > > > > > > > > from the server, after restart: >> > > > > > > > > >> > > > > > > > > 2013-06-23 08:28:46,770 WARN [kafka-request-handler-2] >> > > > > > > server.KafkaApis >> > > > > > > > - >> > > > > > > > > [KafkaApi-508818741] Produce request with correlation id >> > > 7136261 >> > > > > from >> > > > > > > > > client on partition [mytopic,0] failed due to Leader not >> > local >> > > > for >> > > > > > > > > partition [mytopic,0] on broker 508818741 >> > > > > > > > > >> > > > > > > > > This WARN seems to persistently repeat, until the producer >> > > client >> > > > > > > > initiates >> > > > > > > > > a new meta-data request (e.g. every 10 minutes, by >> default). >> > > > > > However, >> > > > > > > > the >> > > > > > > > > producer doesn't log any errors/exceptions when the >> server is >> > > > > logging >> > > > > > > > this >> > > > > > > > > WARN. >> > > > > > > > > >> > > > > > > > > What's happening here? Is the message silently being >> > forwarded >> > > > on >> > > > > to >> > > > > > > the >> > > > > > > > > correct leader for the partition? Is the message dropped? >> > Are >> > > > > these >> > > > > > > > WARNS >> > > > > > > > > particularly useful? >> > > > > > > > > >> > > > > > > > > Thanks, >> > > > > > > > > >> > > > > > > > > Jason >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >