Steve, I don't think there is a better solution at the moment. This is an easy issue to miss in unit testing because generally connections to localhost will be rejected immediately if there isn't anything listening on the port. If you're running in an environment where this happens normally, then for now you'll need to wait for the long timeout.
https://issues.apache.org/jira/browse/KAFKA-2120 may also alleviate the problem by at least reducing the amount of time for the request to fail. Depending on how adventurous you are, you could try using a version with that patch and maybe adjust the setting lower than its default. -Ewen On Wed, Sep 2, 2015 at 10:46 AM, Steve Tian <steve.cs.t...@gmail.com> wrote: > Would kafka dev kindly give us some advice on this? > > Cheers, Steve > > On Tue, Sep 1, 2015, 11:20 PM Steve Tian <steve.cs.t...@gmail.com> wrote: > > > Thanks, Rahul! In my environment I need to have reconnect.backoff.ms > > longer than OS default tcp timeout so that NetworkClient can give second > > node a try. > > > > I believe this is related to > > https://issues.apache.org/jira/browse/KAFKA-2459 . > > > > Cheers, Steve > > > > On Tue, Sep 1, 2015, 5:24 PM Rahul Jain <rahul...@gmail.com> wrote: > > > >> We did notice something similar. When a broker node (out of 3) went > down, > >> metadata calls continued to go to the failed node and producer kept > >> failing. We were able to make it work by increasing the > >> reconnect.backoff.ms > >> to 1 second. > >> > >> Something similar was discussed earlier - > >> > >> > http://qnalist.com/questions/6002514/new-producer-metadata-update-problem-on-2-node-cluster > >> > >> > >> > >> On Mon, Aug 31, 2015 at 11:00 PM, Steve Tian <steve.cs.t...@gmail.com> > >> wrote: > >> > >> > Hi everyone, > >> > > >> > Is there any concerns to have a long reconnect.backoff.ms for new > java > >> > Kafka producer (0.8.2.0/0.8.2.1)? > >> > > >> > Assuming we have bootstrap.servers=host1:port1,host2:port2,host3:port3 > >> and > >> > host1 is *down* in the very beginning. If a newly created Kafka > producer > >> > decide to choose host1 as first node to connect for metadata update, > >> then > >> > that producer will keep trying on host1 *only* as default tcp timeout > is > >> > surely longer than default value of reconnect.backoff.ms, which is 10 > >> ms. > >> > > >> > I am thinking to have reconnect.backoff.ms longer than N * T where N > is > >> > the > >> > number of nodes in bootstrap.servers and T is the default tcp timeout. > >> Is > >> > there any concerns to have a long reconnect.backoff.ms like that? > Any > >> > better solutions? > >> > > >> > Cheers, Steve > >> > > >> > > > -- Thanks, Ewen