Controlled shutdown not relinquishing leadership of all partitions

Luke Steensen Tue, 12 Jan 2016 14:51:07 -0800

Hello,

We've run into a bit of a head-scratcher with a new kafka deployment and
I'm curious if anyone has any ideas.


A little bit of background: this deployment uses "immutable infrastructure"
on AWS, so instead of configuring the host in-place, we stop the broker,
tear down the instance, and replace it wholesale. My understanding was that
controlled shutdown combined with producer retries would allow this
operation to be zero-downtime. Unfortunately, things aren't working quite
as I expected.

After poring over the logs, I pieced together to following chain of events:

   1. our operations script stops the broker process and proceeds to
   terminate the instance
   2. our producer application detects the disconnect and requests updated
   metadata from another node
   3. updated metadata is returned successfully, but the downed broker is
   still listed as leader for a single partition of the given topic
   4. on the next produce request bound for that partition, the producer
   attempts to initiate a connection to the downed host
   5. because the instance has been terminated, the node is now in the
   "connecting" state until the system-level tcp timeout expires (2-3 minutes)
   6. during this time, all produce requests to the given partition sit in
   the record accumulator until they expire and are immediately failed without
   retries
   7. the tcp timeout finally fires, the node is recognized as
   disconnected, more metadata is fetched, and things return to sanity

I was able to work around the issue by waiting 60 seconds between shutting
down the broker and terminating that instance, as well as raising
request.timeout.ms on the producer to 2x our zookeeper timeout. This gives
the broker a much quicker "connection refused" error instead of the
connection timeout and seems to give enough time for normal failure
detection and leader election to kick in before requests are timed out.

So two questions really: (1) are there any known issues that would cause a
controlled shutdown to fail to release leadership of all partitions? and
(2) should the producer be timing out connection attempts more proactively?

Thanks,
Luke

Controlled shutdown not relinquishing leadership of all partitions

Reply via email to