we use 0.9.0.0 and it is working fine. Not all the features work and a few things make a few assumptions about how zookeeper is used. But as a tool for provisioning, expanding and failure recovery it is working fine so far.
*knocks on wood* On Tue, Jan 12, 2016 at 4:08 PM, Luke Steensen < luke.steen...@braintreepayments.com> wrote: > Ah, that's a good idea. Do you know if kafka-manager works with kafka 0.9 > by chance? That would be a nice improvement of the cli tools. > > Thanks, > Luke > > > On Tue, Jan 12, 2016 at 4:53 PM, Scott Reynolds <sreyno...@twilio.com> > wrote: > > > Luke, > > > > We practice the same immutable pattern on AWS. To decommission a broker, > we > > use partition reassignment first to move the partitions off of the node > and > > preferred leadership election. To do this with a web ui, so that you can > > handle it on lizard brain at 3 am, we have the Yahoo Kafka Manager > running > > on the broker hosts. > > > > https://github.com/yahoo/kafka-manager > > > > On Tue, Jan 12, 2016 at 2:50 PM, Luke Steensen < > > luke.steen...@braintreepayments.com> wrote: > > > > > Hello, > > > > > > We've run into a bit of a head-scratcher with a new kafka deployment > and > > > I'm curious if anyone has any ideas. > > > > > > A little bit of background: this deployment uses "immutable > > infrastructure" > > > on AWS, so instead of configuring the host in-place, we stop the > broker, > > > tear down the instance, and replace it wholesale. My understanding was > > that > > > controlled shutdown combined with producer retries would allow this > > > operation to be zero-downtime. Unfortunately, things aren't working > quite > > > as I expected. > > > > > > After poring over the logs, I pieced together to following chain of > > events: > > > > > > 1. our operations script stops the broker process and proceeds to > > > terminate the instance > > > 2. our producer application detects the disconnect and requests > > updated > > > metadata from another node > > > 3. updated metadata is returned successfully, but the downed broker > is > > > still listed as leader for a single partition of the given topic > > > 4. on the next produce request bound for that partition, the > producer > > > attempts to initiate a connection to the downed host > > > 5. because the instance has been terminated, the node is now in the > > > "connecting" state until the system-level tcp timeout expires (2-3 > > > minutes) > > > 6. during this time, all produce requests to the given partition sit > > in > > > the record accumulator until they expire and are immediately failed > > > without > > > retries > > > 7. the tcp timeout finally fires, the node is recognized as > > > disconnected, more metadata is fetched, and things return to sanity > > > > > > I was able to work around the issue by waiting 60 seconds between > > shutting > > > down the broker and terminating that instance, as well as raising > > > request.timeout.ms on the producer to 2x our zookeeper timeout. This > > gives > > > the broker a much quicker "connection refused" error instead of the > > > connection timeout and seems to give enough time for normal failure > > > detection and leader election to kick in before requests are timed out. > > > > > > So two questions really: (1) are there any known issues that would > cause > > a > > > controlled shutdown to fail to release leadership of all partitions? > and > > > (2) should the producer be timing out connection attempts more > > proactively? > > > > > > Thanks, > > > Luke > > > > > >