The producer did not recover on its own.  The app froze for upwards of 10
minutes. Long after net connectivity was restored.  I could telnet to port
9092 on the broker in the error logs. After collecting the heap dump I had
to kill and restart the app.

The stack trace of the producer-network-thread also looks unremarkable.  It
doesn't copy/paste cleanly and I don't want to clutter op the list so I
have it in a gist:

https://gist.github.com/hiloboy0119/caddf76c2601549908cf

I've also dumped the RecordAccumulator state into a google sheet if anyone
wants to double check how I read that to be full:

https://docs.google.com/spreadsheets/d/1Gd3OqctmOEiiwe5WKsLZLsrBaEdDQFa77NVBW9DeXLE

The fields I've highlighted tell me its full.

I'll work trying to simulate the failure

Thanks Gwen!

On Thu, Aug 20, 2015, 5:43 PM Gwen Shapira <g...@confluent.io> wrote:

> It will also be interesting to see stack trace from
> "kafka-producer-network-thread" (which is the one that should be sending
> the batches but maybe got stuck), and if this issue is reproducible for you
> in a test environment - maybe generate logs in TRACE level.
>
> On Thu, Aug 20, 2015 at 5:35 PM, Gwen Shapira <g...@confluent.io> wrote:
>
> > Hi,
> >
> > I didn't see this issue during our network hiccups. You wrote you saw:
> >
> > Got error produce response with correlation id 17717 on topic-partition
> > event.beacon-38, retrying (8 attempts left). Error: NETWORK_EXCEPTION
> >
> > What did you see after? Especially once the network issue was resolved?
> > more retries? was there any successful sends?
> > Producers blocking for a while is expected, but once the issue is
> resolved
> > we expect the retries to success and unblock your producers. Is that what
> > you saw?
> >
> > Gwen
> >
> >
> > On Thu, Aug 20, 2015 at 4:56 PM, Drew Goya <d...@videoamp.com> wrote:
> >
> >> I've been running into an issue with the 0.8.2.1 new producer for a few
> >> weeks now and I haven't been able to figure it out.  Hopefully someone
> on
> >> the list can help!
> >>
> >> First off my producer config looks like this:
> >>
> >>     props.put(ProducerConfig.ACKS_CONFIG, "1")
> >>     props.put(ProducerConfig.RETRIES_CONFIG, "10")
> >>     props.put(ProducerConfig.BLOCK_ON_BUFFER_FULL_CONFIG, "true")
> >>     props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
> >> "org.apache.kafka.common.serialization.ByteArraySerializer")
> >>     props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
> >> "org.apache.kafka.common.serialization.StringSerializer")
> >>     props.put(ProducerConfig.TIMEOUT_CONFIG, "5000")
> >>     props.put(ProducerConfig.METADATA_FETCH_TIMEOUT_CONFIG, "5000")
> >>
> >> During network hiccups between my senders and the brokers  I start
> seeing
> >> these log messages as expected:
> >>
> >> 2015-08-20 20:30:12,231 [kafka-producer-network-thread | producer-1]
> WARN
> >>  org.apache.kafka.common.network.Selector - Error in I/O with
> >> <host>/<ip-address>
> >> java.io.IOException: Connection timed out
> >>         at sun.nio.ch.FileDispatcherImpl.$$YJP$$read0(Native Method)
> >>
> >> followed by:
> >>
> >> Got error produce response with correlation id 17717 on topic-partition
> >> event.beacon-38, retrying (8 attempts left). Error: NETWORK_EXCEPTION
> >>
> >> The problem is that even when network connectivity is restored the whole
> >> app hangs.  Gathering a heap dump and looking through the
> >> RecordAccumulator
> >> I can see that the buffer is full and my producers are blocked
> >> indefinitely.
> >>
> >> Any ideas?
> >>
> >
> >
>

Reply via email to