Creating a new consumer instance *does not* solve this problem. Attaching the producer/consumer code that I used for testing.
On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava <e...@confluent.io> wrote: > I'm not sure about the old producer behavior in this same failure scenario, > but creating a new producer instance would resolve the issue since it would > start with the list of bootstrap nodes and, assuming at least one of them > was up, it would be able to fetch up to date metadata. > > On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg <j...@squareup.com> wrote: > > > Can you clarify, is this issue here specific to the "new" producer? With > > the "old" producer, we routinely construct a new producer which makes a > > fresh metadata request (via a VIP connected to all nodes in the cluster). > > Would this approach work with the new producer? > > > > Jason > > > > > > On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <rahul...@gmail.com> wrote: > > > > > Mayuresh, > > > I was testing this in a development environment and manually brought > > down a > > > node to simulate this. So the dead node never came back up. > > > > > > My colleague and I were able to consistently see this behaviour several > > > times during the testing. > > > On 5 May 2015 20:32, "Mayuresh Gharat" <gharatmayures...@gmail.com> > > wrote: > > > > > > > I agree that to find the least Loaded node the producer should fall > > back > > > to > > > > the bootstrap nodes if its not able to connect to any nodes in the > > > current > > > > metadata. That should resolve this. > > > > > > > > Rahul, I suppose the problem went off because the dead node in your > > case > > > > might have came back up and allowed for a metadata update. Can you > > > confirm > > > > this? > > > > > > > > Thanks, > > > > > > > > Mayuresh > > > > > > > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <rahul...@gmail.com> > wrote: > > > > > > > > > We observed the exact same error. Not very clear about the root > cause > > > > > although it appears to be related to leastLoadedNode > implementation. > > > > > Interestingly, the problem went away by increasing the value of > > > > > reconnect.backoff.ms to 1000ms. > > > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <e...@confluent.io> > > > wrote: > > > > > > > > > > > Ok, all of that makes sense. The only way to possibly recover > from > > > that > > > > > > state is either for K2 to come back up allowing the metadata > > refresh > > > to > > > > > > eventually succeed or to eventually try some other node in the > > > cluster. > > > > > > Reusing the bootstrap nodes is one possibility. Another would be > > for > > > > the > > > > > > client to get more metadata than is required for the topics it > > needs > > > in > > > > > > order to ensure it has more nodes to use as options when looking > > for > > > a > > > > > node > > > > > > to fetch metadata from. I added your description to KAFKA-1843, > > > > although > > > > > it > > > > > > might also make sense as a separate bug since fixing it could be > > > > > considered > > > > > > incremental progress towards resolving 1843. > > > > > > > > > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy < > > > ku...@nmsworks.co.in > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi Ewen, > > > > > > > > > > > > > > Thanks for the response. I agree with you, In some case we > > should > > > > use > > > > > > > bootstrap servers. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If you have logs at debug level, are you seeing this message > in > > > > > between > > > > > > > the > > > > > > > > connection attempts: > > > > > > > > > > > > > > > > Give up sending metadata request since no node is available > > > > > > > > > > > > > > > > > > > > > > Yes, this log came for couple of times. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, if you let it continue running, does it recover after > the > > > > > > > > metadata.max.age.ms timeout? > > > > > > > > > > > > > > > > > > > > > > It does not reconnect. It is continuously trying to connect > > with > > > > dead > > > > > > > node. > > > > > > > > > > > > > > > > > > > > > -Manikumar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Thanks, > > > > > > Ewen > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > -Regards, > > > > Mayuresh R. Gharat > > > > (862) 250-7125 > > > > > > > > > > > > > -- > Thanks, > Ewen >