Luke, We practice the same immutable pattern on AWS. To decommission a broker, we use partition reassignment first to move the partitions off of the node and preferred leadership election. To do this with a web ui, so that you can handle it on lizard brain at 3 am, we have the Yahoo Kafka Manager running on the broker hosts.
https://github.com/yahoo/kafka-manager On Tue, Jan 12, 2016 at 2:50 PM, Luke Steensen < luke.steen...@braintreepayments.com> wrote: > Hello, > > We've run into a bit of a head-scratcher with a new kafka deployment and > I'm curious if anyone has any ideas. > > A little bit of background: this deployment uses "immutable infrastructure" > on AWS, so instead of configuring the host in-place, we stop the broker, > tear down the instance, and replace it wholesale. My understanding was that > controlled shutdown combined with producer retries would allow this > operation to be zero-downtime. Unfortunately, things aren't working quite > as I expected. > > After poring over the logs, I pieced together to following chain of events: > > 1. our operations script stops the broker process and proceeds to > terminate the instance > 2. our producer application detects the disconnect and requests updated > metadata from another node > 3. updated metadata is returned successfully, but the downed broker is > still listed as leader for a single partition of the given topic > 4. on the next produce request bound for that partition, the producer > attempts to initiate a connection to the downed host > 5. because the instance has been terminated, the node is now in the > "connecting" state until the system-level tcp timeout expires (2-3 > minutes) > 6. during this time, all produce requests to the given partition sit in > the record accumulator until they expire and are immediately failed > without > retries > 7. the tcp timeout finally fires, the node is recognized as > disconnected, more metadata is fetched, and things return to sanity > > I was able to work around the issue by waiting 60 seconds between shutting > down the broker and terminating that instance, as well as raising > request.timeout.ms on the producer to 2x our zookeeper timeout. This gives > the broker a much quicker "connection refused" error instead of the > connection timeout and seems to give enough time for normal failure > detection and leader election to kick in before requests are timed out. > > So two questions really: (1) are there any known issues that would cause a > controlled shutdown to fail to release leadership of all partitions? and > (2) should the producer be timing out connection attempts more proactively? > > Thanks, > Luke >