Hi! I help operate a Kafka cluster with a large number of clients (2-10k established connections to the Kafka port at any given time on a given broker).
When we take one of these brokers down for maintenance using a controlled shutdown, all is fine. Bringing it back online is similarly fine - the broker re-joins the cluster and gets back in sync quickly. However, when initiating a preferred replica election, a minority of producers get stuck in a state where they cannot produce to the restored broker, emitting errors like this: Expiring <N> record(s) for <topic>-<partition>: <m> ms has passed since batch creation plus linger time I realize that there are multiple ways of approaching this problem (client vs. broker-side, app-level vs. system-level tuning, etc), but our current situation is that we've got a lot more control over the brokers than the clients, so I'm interested in focusing on what can be done broker-side to make these events less impactful to clients. As an aside: our brokers are currently on Kafka 0.10.2.1 (I know, we're working towards an upgrade, but it's a ways out still), and most clients are on that same version of the client libs. To that end, I've been trying to understand what happens broker-side on the restored broker immediately following the replica election, and I've found two clues: * The TcpExtListenOverflows and TcpExtListenDrops counters both spike briefly * The broker starts sending TCP SYN cookies to clients Based on my reading (primarily this article: https://blog.cloudflare.com/syn-packet-handling-in-the-wild/), it sounds like these symptoms indicate that the SYN and/or accept queues are overflowing. The sizes of those queues appear to be controlled via the listen() call to configure the listening socket, which doesn't appear to be configurable in Kafka. I don't have evidence of this, but I'm speculating that the silent dropping of SYN and/or ACK packets by brokers might be a triggering cause for the hangs seen by some producers. All of this makes me wonder a two things: * Has anyone else seen issues with preferred replica elections causing producers to get stuck like this? If so what remediations have folks put in place for these issues? * Is there a reason that the TCP accept backlog size in the brokers is not configurable? It looks like it just inherits the default of 50 from the JVM. It seems like bumping this queue size is the 'standard' advice given for handling legitimate connection bursts more gracefully. Thanks! Ben