The producer did not recover on its own. The app froze for upwards of 10 minutes. Long after net connectivity was restored. I could telnet to port 9092 on the broker in the error logs. After collecting the heap dump I had to kill and restart the app.
The stack trace of the producer-network-thread also looks unremarkable. It doesn't copy/paste cleanly and I don't want to clutter op the list so I have it in a gist: https://gist.github.com/hiloboy0119/caddf76c2601549908cf I've also dumped the RecordAccumulator state into a google sheet if anyone wants to double check how I read that to be full: https://docs.google.com/spreadsheets/d/1Gd3OqctmOEiiwe5WKsLZLsrBaEdDQFa77NVBW9DeXLE The fields I've highlighted tell me its full. I'll work trying to simulate the failure Thanks Gwen! On Thu, Aug 20, 2015, 5:43 PM Gwen Shapira <g...@confluent.io> wrote: > It will also be interesting to see stack trace from > "kafka-producer-network-thread" (which is the one that should be sending > the batches but maybe got stuck), and if this issue is reproducible for you > in a test environment - maybe generate logs in TRACE level. > > On Thu, Aug 20, 2015 at 5:35 PM, Gwen Shapira <g...@confluent.io> wrote: > > > Hi, > > > > I didn't see this issue during our network hiccups. You wrote you saw: > > > > Got error produce response with correlation id 17717 on topic-partition > > event.beacon-38, retrying (8 attempts left). Error: NETWORK_EXCEPTION > > > > What did you see after? Especially once the network issue was resolved? > > more retries? was there any successful sends? > > Producers blocking for a while is expected, but once the issue is > resolved > > we expect the retries to success and unblock your producers. Is that what > > you saw? > > > > Gwen > > > > > > On Thu, Aug 20, 2015 at 4:56 PM, Drew Goya <d...@videoamp.com> wrote: > > > >> I've been running into an issue with the 0.8.2.1 new producer for a few > >> weeks now and I haven't been able to figure it out. Hopefully someone > on > >> the list can help! > >> > >> First off my producer config looks like this: > >> > >> props.put(ProducerConfig.ACKS_CONFIG, "1") > >> props.put(ProducerConfig.RETRIES_CONFIG, "10") > >> props.put(ProducerConfig.BLOCK_ON_BUFFER_FULL_CONFIG, "true") > >> props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, > >> "org.apache.kafka.common.serialization.ByteArraySerializer") > >> props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, > >> "org.apache.kafka.common.serialization.StringSerializer") > >> props.put(ProducerConfig.TIMEOUT_CONFIG, "5000") > >> props.put(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG, "5000") > >> > >> During network hiccups between my senders and the brokers I start > seeing > >> these log messages as expected: > >> > >> 2015-08-20 20:30:12,231 [kafka-producer-network-thread | producer-1] > WARN > >> org.apache.kafka.common.network.Selector - Error in I/O with > >> <host>/<ip-address> > >> java.io.IOException: Connection timed out > >> at sun.nio.ch.FileDispatcherImpl.$$YJP$$read0(Native Method) > >> > >> followed by: > >> > >> Got error produce response with correlation id 17717 on topic-partition > >> event.beacon-38, retrying (8 attempts left). Error: NETWORK_EXCEPTION > >> > >> The problem is that even when network connectivity is restored the whole > >> app hangs. Gathering a heap dump and looking through the > >> RecordAccumulator > >> I can see that the buffer is full and my producers are blocked > >> indefinitely. > >> > >> Any ideas? > >> > > > > >