Ah, that's a good idea. Do you know if kafka-manager works with kafka 0.9 by chance? That would be a nice improvement of the cli tools.
Thanks, Luke On Tue, Jan 12, 2016 at 4:53 PM, Scott Reynolds <sreyno...@twilio.com> wrote: > Luke, > > We practice the same immutable pattern on AWS. To decommission a broker, we > use partition reassignment first to move the partitions off of the node and > preferred leadership election. To do this with a web ui, so that you can > handle it on lizard brain at 3 am, we have the Yahoo Kafka Manager running > on the broker hosts. > > https://github.com/yahoo/kafka-manager > > On Tue, Jan 12, 2016 at 2:50 PM, Luke Steensen < > luke.steen...@braintreepayments.com> wrote: > > > Hello, > > > > We've run into a bit of a head-scratcher with a new kafka deployment and > > I'm curious if anyone has any ideas. > > > > A little bit of background: this deployment uses "immutable > infrastructure" > > on AWS, so instead of configuring the host in-place, we stop the broker, > > tear down the instance, and replace it wholesale. My understanding was > that > > controlled shutdown combined with producer retries would allow this > > operation to be zero-downtime. Unfortunately, things aren't working quite > > as I expected. > > > > After poring over the logs, I pieced together to following chain of > events: > > > > 1. our operations script stops the broker process and proceeds to > > terminate the instance > > 2. our producer application detects the disconnect and requests > updated > > metadata from another node > > 3. updated metadata is returned successfully, but the downed broker is > > still listed as leader for a single partition of the given topic > > 4. on the next produce request bound for that partition, the producer > > attempts to initiate a connection to the downed host > > 5. because the instance has been terminated, the node is now in the > > "connecting" state until the system-level tcp timeout expires (2-3 > > minutes) > > 6. during this time, all produce requests to the given partition sit > in > > the record accumulator until they expire and are immediately failed > > without > > retries > > 7. the tcp timeout finally fires, the node is recognized as > > disconnected, more metadata is fetched, and things return to sanity > > > > I was able to work around the issue by waiting 60 seconds between > shutting > > down the broker and terminating that instance, as well as raising > > request.timeout.ms on the producer to 2x our zookeeper timeout. This > gives > > the broker a much quicker "connection refused" error instead of the > > connection timeout and seems to give enough time for normal failure > > detection and leader election to kick in before requests are timed out. > > > > So two questions really: (1) are there any known issues that would cause > a > > controlled shutdown to fail to release leadership of all partitions? and > > (2) should the producer be timing out connection attempts more > proactively? > > > > Thanks, > > Luke > > >