Hello, We've run into a bit of a head-scratcher with a new kafka deployment and I'm curious if anyone has any ideas.
A little bit of background: this deployment uses "immutable infrastructure" on AWS, so instead of configuring the host in-place, we stop the broker, tear down the instance, and replace it wholesale. My understanding was that controlled shutdown combined with producer retries would allow this operation to be zero-downtime. Unfortunately, things aren't working quite as I expected. After poring over the logs, I pieced together to following chain of events: 1. our operations script stops the broker process and proceeds to terminate the instance 2. our producer application detects the disconnect and requests updated metadata from another node 3. updated metadata is returned successfully, but the downed broker is still listed as leader for a single partition of the given topic 4. on the next produce request bound for that partition, the producer attempts to initiate a connection to the downed host 5. because the instance has been terminated, the node is now in the "connecting" state until the system-level tcp timeout expires (2-3 minutes) 6. during this time, all produce requests to the given partition sit in the record accumulator until they expire and are immediately failed without retries 7. the tcp timeout finally fires, the node is recognized as disconnected, more metadata is fetched, and things return to sanity I was able to work around the issue by waiting 60 seconds between shutting down the broker and terminating that instance, as well as raising request.timeout.ms on the producer to 2x our zookeeper timeout. This gives the broker a much quicker "connection refused" error instead of the connection timeout and seems to give enough time for normal failure detection and leader election to kick in before requests are timed out. So two questions really: (1) are there any known issues that would cause a controlled shutdown to fail to release leadership of all partitions? and (2) should the producer be timing out connection attempts more proactively? Thanks, Luke