Kris, This is a bit surprising, but handling the bootstrap servers, broker failures/retirement, and cluster metadata properly is surprisingly hard to get right!
https://issues.apache.org/jira/browse/KAFKA-1843 explains some of the challenges. https://issues.apache.org/jira/browse/KAFKA-3068 shows the types of issues that can result from trying to better recover from failures or your situation of graceful shutdown. I think https://issues.apache.org/jira/browse/KAFKA-2459 might have addressed the incorrect behavior you are seeing in 0.8.2.1 -- the same bootstrap broker could be selected due to incorrect handling of backoff/timeouts. I can't be sure without more info, but it sounds like it could be the same issue. Despite part of the fix being rolled back due to KAFKA-3068, I think the relevant part which fixes the timeouts should still be present in 0.9.0.1. If you can easily reproduce, could you test if the newer release fixes the issue for you? -Ewen On Mon, Feb 22, 2016 at 9:37 PM, Kris K <squareksc...@gmail.com> wrote: > Hi All, > > I saw an issue today wherein the producers (new producers) started to fail > with org.apache.kafka.common.errors.TimeoutException: Failed to update > metadata after 60000 ms. > > This issue happened when we took down one of the 6 brokers (running version > 0.8.2.1) for planned maintenance (graceful shutdown). > > This broker happens to be the last one in the list of 3 brokers that are > part of bootstrap.servers. > > As per my understanding, the producers should have used the other two > brokers in the bootstrap.servers list for metadata calls. But this did not > happen. > > Is there any producer property that could have caused this? Any way to > figure out which broker is being used by producers for metadata calls? > > Thanks, > Kris > -- Thanks, Ewen