On Fri, Oct 25, 2013 at 1:18 AM, Jason Rosenberg <j...@squareup.com> wrote:
> I'm running into an issue where sometimes, the controlled shutdown fails to
> complete after the default 3 retry attempts.  This ended up in one case,
> with a broker under going an unclean shutdown, and then it was in a rather
> bad state after restart.  Producers would connect to the metadata vip,
> still think that this broker was the leader, and then fail on that leader,
> and then reconnect to to the metadata vip, and get sent back to that same
> failed broker!   Does that make sense?
>
> I'm trying to understand the conditions which cause the controlled shutdown
> to fail?  There doesn't seem to be a setting for max amount of time to
> wait, etc.

There is a retry interval (controlled.shutdown.retry.backoff.ms).
Controlled shutdown fails when the controller is unable to move
leadership of partitions from the broker being shutdown to another
broker. This happens for instance when the broker being shutdown is
the only replica for any partition that it leads (i.e., if the
follower replicas are out of ISR). Each attempt will report the
partitions remaining on the broker. If all retries are exhausted then
we proceed to an uncontrolled shutdown. So for the partitions for
which leadership could not be moved we will do an unclean leader
election 
(https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-Whathappenswhentherearenootherreplicasinthe%22insync%22setforapartition%3F).
If your producers are getting sent back to the same broker that's
weird (assuming you have a > 1 replication factor) You will need to
look into your controller and state change logs to determine what's
going on.

>
> Also, what are the ramifications for different settings for the
> controlled.shutdown.retry.backoff.ms?  Is there a reason we want to wait
> before retrying again (again, it would be helpful to understand the reasons
> for a controlled shutdown failure).

Yes - e.g., if a partition is underreplicated due to a follower being
slow for any reason. Controlled shutdown would fail but may succeed
after a short while if the follower re-enters ISR. That reason could
be a bounce. i.e., when you are doing a rolling bounce and bring up a
broker that was just shut down the partitions that are assigned to it
would be under-replicated - but you would have proceeded to do a
controlled shutdown on the next broker in the sequence. That broker
may also be assigned partitions that were on the preceding broker in
the sequence - in which case those partitions being underreplicated
would cause controlled shutdown to fail. However, after the previous
broker is fully caught up (in a few ms or seconds.. depending on how
long it was down) it will succeed - i.e., if the retry interval times
num.retries is set large enough to the expected period of
underreplication it will succeed. There are a couple other scenarios
that are outlined on the original controlled shutdown jira which I can
look up later in the day if you are interested.

Joel

Reply via email to