Luke,

We practice the same immutable pattern on AWS. To decommission a broker, we
use partition reassignment first to move the partitions off of the node and
preferred leadership election. To do this with a web ui, so that you can
handle it on lizard brain at 3 am, we have the Yahoo Kafka Manager running
on the broker hosts.

https://github.com/yahoo/kafka-manager

On Tue, Jan 12, 2016 at 2:50 PM, Luke Steensen <
luke.steen...@braintreepayments.com> wrote:

> Hello,
>
> We've run into a bit of a head-scratcher with a new kafka deployment and
> I'm curious if anyone has any ideas.
>
> A little bit of background: this deployment uses "immutable infrastructure"
> on AWS, so instead of configuring the host in-place, we stop the broker,
> tear down the instance, and replace it wholesale. My understanding was that
> controlled shutdown combined with producer retries would allow this
> operation to be zero-downtime. Unfortunately, things aren't working quite
> as I expected.
>
> After poring over the logs, I pieced together to following chain of events:
>
>    1. our operations script stops the broker process and proceeds to
>    terminate the instance
>    2. our producer application detects the disconnect and requests updated
>    metadata from another node
>    3. updated metadata is returned successfully, but the downed broker is
>    still listed as leader for a single partition of the given topic
>    4. on the next produce request bound for that partition, the producer
>    attempts to initiate a connection to the downed host
>    5. because the instance has been terminated, the node is now in the
>    "connecting" state until the system-level tcp timeout expires (2-3
> minutes)
>    6. during this time, all produce requests to the given partition sit in
>    the record accumulator until they expire and are immediately failed
> without
>    retries
>    7. the tcp timeout finally fires, the node is recognized as
>    disconnected, more metadata is fetched, and things return to sanity
>
> I was able to work around the issue by waiting 60 seconds between shutting
> down the broker and terminating that instance, as well as raising
> request.timeout.ms on the producer to 2x our zookeeper timeout. This gives
> the broker a much quicker "connection refused" error instead of the
> connection timeout and seems to give enough time for normal failure
> detection and leader election to kick in before requests are timed out.
>
> So two questions really: (1) are there any known issues that would cause a
> controlled shutdown to fail to release leadership of all partitions? and
> (2) should the producer be timing out connection attempts more proactively?
>
> Thanks,
> Luke
>

Reply via email to