Yes, this lines up with the behavior I'm seeing. I'll wait for the patch to be released and then retest. Thanks!
On Tue, Nov 4, 2014 at 2:36 PM, Guozhang Wang <wangg...@gmail.com> wrote: > Actually I think this issue has just been resolved: > > https://issues.apache.org/jira/browse/KAFKA-1733 > > Guozhang > > On Tue, Nov 4, 2014 at 11:22 AM, Guozhang Wang <wangg...@gmail.com> wrote: > > > Hello Solon, > > > > request.timeout.ms only controls the produce request timeout value, when > > the producer's first produce request gets timed out, it will try to > > re-fresh its metadata by sending metadata request. But when this > > non-produce request hits the broker whose connectivity has been disabled > > (i.e. trying to re-connect to that broker), it will not be respecting 1 > sec > > timeout. > > > > I think this is indeed an issue: basically when we gets a request time > out > > from the broker, we would avoid trying to re-connect to it refreshing > > metadata. Could you file a JIRA for this? > > > > Guozhang > > > > > > On Tue, Nov 4, 2014 at 10:43 AM, Solon Gordon <so...@knewton.com> wrote: > > > >> Hi all, > >> > >> I've been investigating how Kafka 0.8.1.1 responds to the scenario where > >> one broker loses connectivity (due to something like a hardware issue or > >> network partition.) It looks like the brokers themselves adjust within a > >> few seconds to reassign leaders and shrink ISRs. However, I see producer > >> threads block for multiple minutes before timing out, regardless of what > >> producer settings I use. Why would this be? > >> > >> Here is my test procedure: > >> 1. Start up three brokers. > >> 2. Create a topic with 3 partitions and replication factor 3. > >> 3. Start up a producer with producer.type=sync, request.required.acks=1, > >> request.timeout.ms=1000, message.send.max.retries=0. (With this > >> configuration I'd expect all requests to complete or error within a > >> second.) > >> 4. Make the producer send one message per second. > >> 5. Disable connectivity for one broker via iptables. > >> > >> The result is that I see the producer block for almost two minutes > before > >> timing out, way more than the one second timeout I configured. Often I > see > >> that the first request to the bad broker times out after a second as > >> expected, but a subsequent request takes minutes to time out. I've > >> included > >> example producer logs below. > >> > >> Any idea why this would happen or if there is some config option I'm > >> missing to prevent it? We would like to be able to recover from this > >> scenario in seconds, not minutes. > >> > >> Thanks, > >> Solon > >> > >> > >> First request times out after a second: > >> 17:48:48.602 [Producer timer] DEBUG > k.producer.async.DefaultEventHandler - > >> Producer sending messages with correlation id 30 for to > >> pics [latency-measurer,0] to broker XXX on YYY:9092 > >> 17:48:49.604 [Producer timer] INFO kafka.producer.SyncProducer - > >> Disconnecting from YYY:9092 > >> 17:48:49.617 [Producer timer] WARN > k.producer.async.DefaultEventHandler - > >> Failed to send producer request with correlation id 30 > >> to broker XXX with data for partitions [latency-measurer,0] > >> java.net.SocketTimeoutException: null > >> at > >> sun.nio.ch.SocketAdaptor$SocketInputStream.read(SocketAdaptor.java:229) > >> ~[na:1.7.0_55] > >> at > sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) > >> ~[na:1.7.0_55] > >> at > >> > java.nio.channels.Channels$ReadableByteChannelImpl.read(Channels.java:385) > >> ~[na:1.7.0_55] > >> at kafka.utils.Utils$.read(Unknown Source) > >> at kafka.network.BoundedByteBufferReceive.readFrom(Unknown > Source) > >> at kafka.network.Receive$class.readCompletely(Unknown Source) > >> at kafka.network.BoundedByteBufferReceive.readCompletely(Unknown > >> Source) > >> at kafka.network.BlockingChannel.receive(Unknown Source) > >> at kafka.producer.SyncProducer.liftedTree1$1(Unknown Source) > >> at > >> kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(Unknown > >> Source) > >> ... > >> > >> The next takes over two minutes: > >> 17:48:50.602 [Producer timer] DEBUG > k.producer.async.DefaultEventHandler - > >> Producer sending messages with correlation id 35 for topics > >> [latency-measurer,0] to broker XXX on YYY:9092 > >> 17:50:57.793 [Producer timer] ERROR kafka.producer.SyncProducer - > Producer > >> connection to YYY:9092 unsuccessful > >> java.net.ConnectException: Connection timed out > >> at sun.nio.ch.Net.connect0(Native Method) ~[na:1.7.0_55] > >> at sun.nio.ch.Net.connect(Net.java:465) ~[na:1.7.0_55] > >> at sun.nio.ch.Net.connect(Net.java:457) ~[na:1.7.0_55] > >> at > >> sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:670) > >> ~[na:1.7.0_55] > >> at kafka.network.BlockingChannel.connect(Unknown Source) > >> at kafka.producer.SyncProducer.connect(Unknown Source) > >> at kafka.producer.SyncProducer.getOrMakeConnection(Unknown > Source) > >> ... > >> > > > > > > > > -- > > -- Guozhang > > > > > > -- > -- Guozhang >