Controlled shutdown can fail if the cluster has non zero under replicated
partition count. Since that means the leaders may not move off of the
broker being shutdown, causing controlled shutdown to fail. The backoff
might help if the under replication is just temporary due to a spike in
traffic. This is the most common reason it might fail besides bugs. But you
can check the logs to see why the shutdown failed.

Thanks,
Neha
On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <j...@squareup.com> wrote:

> I'm running into an issue where sometimes, the controlled shutdown fails to
> complete after the default 3 retry attempts.  This ended up in one case,
> with a broker under going an unclean shutdown, and then it was in a rather
> bad state after restart.  Producers would connect to the metadata vip,
> still think that this broker was the leader, and then fail on that leader,
> and then reconnect to to the metadata vip, and get sent back to that same
> failed broker!   Does that make sense?
>
> I'm trying to understand the conditions which cause the controlled shutdown
> to fail?  There doesn't seem to be a setting for max amount of time to
> wait, etc.
>
> It would be nice to specify how long to try before giving up (hopefully
> giving up in a graceful way).
>
> Instead, we have a retry count, but it's not clear what this retry count is
> really specifying, in terms of how long to keep trying, etc.
>
> Also, what are the ramifications for different settings for the
> controlled.shutdown.retry.backoff.ms?  Is there a reason we want to wait
> before retrying again (again, it would be helpful to understand the reasons
> for a controlled shutdown failure).
>
> Thanks,
>
> Jason
>

Reply via email to