I faced the exact same problem recently. The JIRA is filed here:

Please have reconnect.backoff.ms to be greater than retry.backoff.ms (like
1sec more). I think the metadata expired and when it is trying to fetch the
new metadata for this producer instance, it is trying to connect to the
broker that is down but the reconnect.backoff.ms would not effect this
because the retry.backoff.ms is equal to reconnect.backoff.ms and so it
would already be expired and the node is not blacklisted and so the it
would loop to fetch metadata on the same broker.

On Fri, Aug 28, 2015 at 8:39 AM, Helleren, Erik <erik.helle...@cmegroup.com>

> Hi Alexey,
> So, a couple things.  Your config seems to have some issues that would
> result in long wait times,
> You should try this configuration and see if you still have the issue:
> acks=1
> compression.type=snappy
> retries=3 #Retry a few times to make it so they don¹t get dropped when a
> broker fails, at least not right away
> batch_size= 32768
> buffer.memory=67108864
> linger.ms=1500
> metadata.fetch.timeout.ms=60000 # Default to give zookeeper a lot of time
> to return the metadata
> timeout.ms= 10000 #give kafka some time to respond before you consider it
> a failure.
> retry.backoff.ms=100 # Default. Keep this small so the producer fails
> quickly enough times to know a broker is down
> reconnect.backoff.ms=10 # Default. Same reason as above
> Hopefully the explanations of the changes make sense.  At the very least,
> I would try changing retires up to 2 first.  Also, what is your topic¹s
> configuration?
> -Erik
> On 8/28/15, 8:36 AM, "Alexey Sverdelov" <alexey.sverde...@googlemail.com>
> wrote:
> >Hi everyone,
> >
> >we run load tests against our web application (about 50K req/sec) and
> >every
> >time a kafka broker dies (also controlled shutdown), the producer tries to
> >connect with the dead broker for about 10-15 minutes. For this time the
> >application monitoring shows a constant error rate (about of 1/10 all
> >kafka
> >writes fail).
> >
> >Our spec:
> >
> >* web-app in tomcat writes to kafka
> >* 3 node kafka cluster
> >* kafka 0.8.2
> >* new producer
> >
> >The producer config:
> >
> >acks=1
> >compression.type=snappy
> >retries=0
> >batch_size=32768
> >buffer.memory=67108864
> >linger.ms=1500
> >metadata.fetch.timeout.ms=5000
> >timeout.ms= 1500
> >retry.backoff.ms=10000
> >reconnect.backoff.ms=10000
> >
> >I can poll our Zookeeper and check if all brokers are alive, but I think
> >KafkaProducer checks it already.
> >
> >Alexey

