Yes, this is something that we could consider fixing in Kafka itself.
Pretty much all timeouts can be customized if the defaults for the
OS/network are larger than make sense for the system. And given the large
default values for some of these timeouts, we probably don't want to rely
on the defaults.

-Ewen

On Mon, Dec 19, 2016 at 8:23 AM, Luke Steensen <luke.steensen@
braintreepayments.com> wrote:

> Makes sense, thanks Ewen.
>
> Is this something we could consider fixing in Kafka itself? I don't think
> the producer is necessarily doing anything wrong, but the end result is
> certainly very surprising behavior. It would also be nice not to have to
> coordinate request timeouts, retries, and the max block configuration with
> system-level configs.
>
>
> On Sat, Dec 17, 2016 at 6:55 PM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > Without having dug back into the code to check, this sounds right.
> > Connection management just fires off a request to connect and then
> > subsequent poll() calls will handle any successful/failed connections.
> The
> > timeouts wrt requests are handled somewhat differently (the connection
> > request isn't explicitly tied to the request that triggered it, so when
> the
> > latter times out, we don't follow up and timeout the connection request
> > either).
> >
> > So yes, you currently will have connection requests tied to your
> underlying
> > TCP timeout request. This tends to be much more of a problem in public
> > clouds where the handshake request will be silently dropped due to
> firewall
> > rules.
> >
> > The metadata.max.age.ms is a workable solution, but agreed that it's not
> > great. If possible, reducing the default TCP connection timeout isn't
> > unreasonable either -- the defaults are set for WAN connections (and
> > arguably set for WAN connections of long ago), so much more aggressive
> > timeouts are reasonable for Kafka clusters.
> >
> > -Ewen
> >
> > On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensen <luke.steensen@
> > braintreepayments.com> wrote:
> >
> > > Hello,
> > >
> > > Is it correct that producers do not fail new connection establishment
> > when
> > > it exceeds the request timeout?
> > >
> > > Running on AWS, we've encountered a problem where certain very low
> volume
> > > producers end up with metadata that's sufficiently stale that they
> > attempt
> > > to establish a connection to a broker instance that has already been
> > > terminated as part of a maintenance operation. I would expect this to
> > fail
> > > and be retried normally, but it appears to hang until the system-level
> > TCP
> > > connection timeout is reached (2-3 minutes), with the writes themselves
> > > being expired before even a single attempt is made to send them.
> > >
> > > We've worked around the issue by setting `metadata.max.age.ms`
> extremely
> > > low, such that these producers are requesting new metadata much faster
> > than
> > > our maintenance operations are terminating instances. While this does
> > work,
> > > it seems like an unfortunate workaround for some very surprising
> > behavior.
> > >
> > > Thanks,
> > > Luke
> > >
> >
>

Reply via email to