It will also be interesting to see stack trace from "kafka-producer-network-thread" (which is the one that should be sending the batches but maybe got stuck), and if this issue is reproducible for you in a test environment - maybe generate logs in TRACE level.
On Thu, Aug 20, 2015 at 5:35 PM, Gwen Shapira <g...@confluent.io> wrote: > Hi, > > I didn't see this issue during our network hiccups. You wrote you saw: > > Got error produce response with correlation id 17717 on topic-partition > event.beacon-38, retrying (8 attempts left). Error: NETWORK_EXCEPTION > > What did you see after? Especially once the network issue was resolved? > more retries? was there any successful sends? > Producers blocking for a while is expected, but once the issue is resolved > we expect the retries to success and unblock your producers. Is that what > you saw? > > Gwen > > > On Thu, Aug 20, 2015 at 4:56 PM, Drew Goya <d...@videoamp.com> wrote: > >> I've been running into an issue with the 0.8.2.1 new producer for a few >> weeks now and I haven't been able to figure it out. Hopefully someone on >> the list can help! >> >> First off my producer config looks like this: >> >> props.put(ProducerConfig.ACKS_CONFIG, "1") >> props.put(ProducerConfig.RETRIES_CONFIG, "10") >> props.put(ProducerConfig.BLOCK_ON_BUFFER_FULL_CONFIG, "true") >> props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, >> "org.apache.kafka.common.serialization.ByteArraySerializer") >> props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, >> "org.apache.kafka.common.serialization.StringSerializer") >> props.put(ProducerConfig.TIMEOUT_CONFIG, "5000") >> props.put(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG, "5000") >> >> During network hiccups between my senders and the brokers I start seeing >> these log messages as expected: >> >> 2015-08-20 20:30:12,231 [kafka-producer-network-thread | producer-1] WARN >> org.apache.kafka.common.network.Selector - Error in I/O with >> <host>/<ip-address> >> java.io.IOException: Connection timed out >> at sun.nio.ch.FileDispatcherImpl.$$YJP$$read0(Native Method) >> >> followed by: >> >> Got error produce response with correlation id 17717 on topic-partition >> event.beacon-38, retrying (8 attempts left). Error: NETWORK_EXCEPTION >> >> The problem is that even when network connectivity is restored the whole >> app hangs. Gathering a heap dump and looking through the >> RecordAccumulator >> I can see that the buffer is full and my producers are blocked >> indefinitely. >> >> Any ideas? >> > >