Makes sense, thanks Ewen. Is this something we could consider fixing in Kafka itself? I don't think the producer is necessarily doing anything wrong, but the end result is certainly very surprising behavior. It would also be nice not to have to coordinate request timeouts, retries, and the max block configuration with system-level configs.
On Sat, Dec 17, 2016 at 6:55 PM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > Without having dug back into the code to check, this sounds right. > Connection management just fires off a request to connect and then > subsequent poll() calls will handle any successful/failed connections. The > timeouts wrt requests are handled somewhat differently (the connection > request isn't explicitly tied to the request that triggered it, so when the > latter times out, we don't follow up and timeout the connection request > either). > > So yes, you currently will have connection requests tied to your underlying > TCP timeout request. This tends to be much more of a problem in public > clouds where the handshake request will be silently dropped due to firewall > rules. > > The metadata.max.age.ms is a workable solution, but agreed that it's not > great. If possible, reducing the default TCP connection timeout isn't > unreasonable either -- the defaults are set for WAN connections (and > arguably set for WAN connections of long ago), so much more aggressive > timeouts are reasonable for Kafka clusters. > > -Ewen > > On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensen <luke.steensen@ > braintreepayments.com> wrote: > > > Hello, > > > > Is it correct that producers do not fail new connection establishment > when > > it exceeds the request timeout? > > > > Running on AWS, we've encountered a problem where certain very low volume > > producers end up with metadata that's sufficiently stale that they > attempt > > to establish a connection to a broker instance that has already been > > terminated as part of a maintenance operation. I would expect this to > fail > > and be retried normally, but it appears to hang until the system-level > TCP > > connection timeout is reached (2-3 minutes), with the writes themselves > > being expired before even a single attempt is made to send them. > > > > We've worked around the issue by setting `metadata.max.age.ms` extremely > > low, such that these producers are requesting new metadata much faster > than > > our maintenance operations are terminating instances. While this does > work, > > it seems like an unfortunate workaround for some very surprising > behavior. > > > > Thanks, > > Luke > > >