Do you happen to have broker-logs and state-change logs from the controlled shutdown attempt?
In theory, the producer should not really see a disconnect - it should get NotALeader exception (because leaders are re-assigned before the shutdown) that will cause it to get the metadata. I am guessing that leadership actually *was* transferred but the broker that answered the metadata request did not get the news . In 0.8.2 we had some bugs regarding how membership info is distributed to all nodes. This was resolved in 0.9.0.0, so perhaps an upgrade will help. Gwen On Wed, Jan 13, 2016 at 11:22 AM, Luke Steensen < luke.steen...@braintreepayments.com> wrote: > Yes, that was my intention and we have both of those configs turned on. For > some reason, however, the controlled shutdown wasn't transferring > leadership of all partitions, which caused the issues I described in my > initial email. > > > On Wed, Jan 13, 2016 at 12:05 AM, Ján Koščo <3k.stan...@gmail.com> wrote: > > > Not sure, but should combination of auto.leader.rebalance.enable=true > > and controlled.shutdown.enable=true sort this out for you? > > > > 2016-01-13 1:13 GMT+01:00 Scott Reynolds <sreyno...@twilio.com>: > > > > > we use 0.9.0.0 and it is working fine. Not all the features work and a > > few > > > things make a few assumptions about how zookeeper is used. But as a > tool > > > for provisioning, expanding and failure recovery it is working fine so > > far. > > > > > > *knocks on wood* > > > > > > On Tue, Jan 12, 2016 at 4:08 PM, Luke Steensen < > > > luke.steen...@braintreepayments.com> wrote: > > > > > > > Ah, that's a good idea. Do you know if kafka-manager works with kafka > > 0.9 > > > > by chance? That would be a nice improvement of the cli tools. > > > > > > > > Thanks, > > > > Luke > > > > > > > > > > > > On Tue, Jan 12, 2016 at 4:53 PM, Scott Reynolds < > sreyno...@twilio.com> > > > > wrote: > > > > > > > > > Luke, > > > > > > > > > > We practice the same immutable pattern on AWS. To decommission a > > > broker, > > > > we > > > > > use partition reassignment first to move the partitions off of the > > node > > > > and > > > > > preferred leadership election. To do this with a web ui, so that > you > > > can > > > > > handle it on lizard brain at 3 am, we have the Yahoo Kafka Manager > > > > running > > > > > on the broker hosts. > > > > > > > > > > https://github.com/yahoo/kafka-manager > > > > > > > > > > On Tue, Jan 12, 2016 at 2:50 PM, Luke Steensen < > > > > > luke.steen...@braintreepayments.com> wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > We've run into a bit of a head-scratcher with a new kafka > > deployment > > > > and > > > > > > I'm curious if anyone has any ideas. > > > > > > > > > > > > A little bit of background: this deployment uses "immutable > > > > > infrastructure" > > > > > > on AWS, so instead of configuring the host in-place, we stop the > > > > broker, > > > > > > tear down the instance, and replace it wholesale. My > understanding > > > was > > > > > that > > > > > > controlled shutdown combined with producer retries would allow > this > > > > > > operation to be zero-downtime. Unfortunately, things aren't > working > > > > quite > > > > > > as I expected. > > > > > > > > > > > > After poring over the logs, I pieced together to following chain > of > > > > > events: > > > > > > > > > > > > 1. our operations script stops the broker process and proceeds > > to > > > > > > terminate the instance > > > > > > 2. our producer application detects the disconnect and > requests > > > > > updated > > > > > > metadata from another node > > > > > > 3. updated metadata is returned successfully, but the downed > > > broker > > > > is > > > > > > still listed as leader for a single partition of the given > topic > > > > > > 4. on the next produce request bound for that partition, the > > > > producer > > > > > > attempts to initiate a connection to the downed host > > > > > > 5. because the instance has been terminated, the node is now > in > > > the > > > > > > "connecting" state until the system-level tcp timeout expires > > (2-3 > > > > > > minutes) > > > > > > 6. during this time, all produce requests to the given > partition > > > sit > > > > > in > > > > > > the record accumulator until they expire and are immediately > > > failed > > > > > > without > > > > > > retries > > > > > > 7. the tcp timeout finally fires, the node is recognized as > > > > > > disconnected, more metadata is fetched, and things return to > > > sanity > > > > > > > > > > > > I was able to work around the issue by waiting 60 seconds between > > > > > shutting > > > > > > down the broker and terminating that instance, as well as raising > > > > > > request.timeout.ms on the producer to 2x our zookeeper timeout. > > This > > > > > gives > > > > > > the broker a much quicker "connection refused" error instead of > the > > > > > > connection timeout and seems to give enough time for normal > failure > > > > > > detection and leader election to kick in before requests are > timed > > > out. > > > > > > > > > > > > So two questions really: (1) are there any known issues that > would > > > > cause > > > > > a > > > > > > controlled shutdown to fail to release leadership of all > > partitions? > > > > and > > > > > > (2) should the producer be timing out connection attempts more > > > > > proactively? > > > > > > > > > > > > Thanks, > > > > > > Luke > > > > > > > > > > > > > > > > > > > > >