Kris,

This is a bit surprising, but handling the bootstrap servers, broker
failures/retirement, and cluster metadata properly is surprisingly hard to
get right!

https://issues.apache.org/jira/browse/KAFKA-1843 explains some of the
challenges. https://issues.apache.org/jira/browse/KAFKA-3068 shows the
types of issues that can result from trying to better recover from failures
or your situation of graceful shutdown.

I think https://issues.apache.org/jira/browse/KAFKA-2459 might have
addressed the incorrect behavior you are seeing in 0.8.2.1 -- the same
bootstrap broker could be selected due to incorrect handling of
backoff/timeouts. I can't be sure without more info, but it sounds like it
could be the same issue. Despite part of the fix being rolled back due to
KAFKA-3068, I think the relevant part which fixes the timeouts should still
be present in 0.9.0.1. If you can easily reproduce, could you test if the
newer release fixes the issue for you?

-Ewen

On Mon, Feb 22, 2016 at 9:37 PM, Kris K <squareksc...@gmail.com> wrote:

> Hi All,
>
> I saw an issue today wherein the producers (new producers) started to fail
> with org.apache.kafka.common.errors.TimeoutException: Failed to update
> metadata after 60000 ms.
>
> This issue happened when we took down one of the 6 brokers (running version
> 0.8.2.1) for planned maintenance (graceful shutdown).
>
> This broker happens to be the last one in the list of 3 brokers that are
> part of bootstrap.servers.
>
> As per my understanding, the producers should have used the other two
> brokers in the bootstrap.servers list for metadata calls. But this did not
> happen.
>
> Is there any producer property that could have caused this? Any way to
> figure out which broker is being used by producers for metadata calls?
>
> Thanks,
> Kris
>



-- 
Thanks,
Ewen

Reply via email to