Hi!

I help operate a Kafka cluster with a large number of clients (2-10k
established connections to the Kafka port at any given time on a given
broker).

When we take one of these brokers down for maintenance using a controlled
shutdown, all is fine. Bringing it back online is similarly fine - the
broker re-joins the cluster and gets back in sync quickly. However, when
initiating a preferred replica election, a minority of producers get stuck
in a state where they cannot produce to the restored broker, emitting
errors like this:

Expiring <N> record(s) for <topic>-<partition>: <m> ms has passed since
batch creation plus linger time

I realize that there are multiple ways of approaching this problem (client
vs. broker-side, app-level vs. system-level tuning, etc), but our current
situation is that we've got a lot more control over the brokers than the
clients, so I'm interested in focusing on what can be done broker-side to
make these events less impactful to clients. As an aside: our brokers are
currently on Kafka 0.10.2.1 (I know, we're working towards an upgrade, but
it's a ways out still), and most clients are on that same version of the
client libs.

To that end, I've been trying to understand what happens broker-side on the
restored broker immediately following the replica election, and I've found
two clues:

* The TcpExtListenOverflows and TcpExtListenDrops counters both spike
briefly
* The broker starts sending TCP SYN cookies to clients

Based on my reading (primarily this article:
https://blog.cloudflare.com/syn-packet-handling-in-the-wild/), it sounds
like these symptoms indicate that the SYN and/or accept queues are
overflowing. The sizes of those queues appear to be controlled via the
listen() call to configure the listening socket, which doesn't appear to be
configurable in Kafka.

I don't have evidence of this, but I'm speculating that the silent dropping
of SYN and/or ACK packets by brokers might be a triggering cause for the
hangs seen by some producers.

All of this makes me wonder a two things:

* Has anyone else seen issues with preferred replica elections causing
producers to get stuck like this? If so what remediations have folks put in
place for these issues?
* Is there a reason that the TCP accept backlog size in the brokers is not
configurable? It looks like it just inherits the default of 50 from the
JVM. It seems like bumping this queue size is the 'standard' advice given
for handling legitimate connection bursts more gracefully.

Thanks!
Ben

Reply via email to