On Fri, Oct 25, 2013 at 1:18 AM, Jason Rosenberg <j...@squareup.com> wrote: > I'm running into an issue where sometimes, the controlled shutdown fails to > complete after the default 3 retry attempts. This ended up in one case, > with a broker under going an unclean shutdown, and then it was in a rather > bad state after restart. Producers would connect to the metadata vip, > still think that this broker was the leader, and then fail on that leader, > and then reconnect to to the metadata vip, and get sent back to that same > failed broker! Does that make sense? > > I'm trying to understand the conditions which cause the controlled shutdown > to fail? There doesn't seem to be a setting for max amount of time to > wait, etc.
There is a retry interval (controlled.shutdown.retry.backoff.ms). Controlled shutdown fails when the controller is unable to move leadership of partitions from the broker being shutdown to another broker. This happens for instance when the broker being shutdown is the only replica for any partition that it leads (i.e., if the follower replicas are out of ISR). Each attempt will report the partitions remaining on the broker. If all retries are exhausted then we proceed to an uncontrolled shutdown. So for the partitions for which leadership could not be moved we will do an unclean leader election (https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-Whathappenswhentherearenootherreplicasinthe%22insync%22setforapartition%3F). If your producers are getting sent back to the same broker that's weird (assuming you have a > 1 replication factor) You will need to look into your controller and state change logs to determine what's going on. > > Also, what are the ramifications for different settings for the > controlled.shutdown.retry.backoff.ms? Is there a reason we want to wait > before retrying again (again, it would be helpful to understand the reasons > for a controlled shutdown failure). Yes - e.g., if a partition is underreplicated due to a follower being slow for any reason. Controlled shutdown would fail but may succeed after a short while if the follower re-enters ISR. That reason could be a bounce. i.e., when you are doing a rolling bounce and bring up a broker that was just shut down the partitions that are assigned to it would be under-replicated - but you would have proceeded to do a controlled shutdown on the next broker in the sequence. That broker may also be assigned partitions that were on the preceding broker in the sequence - in which case those partitions being underreplicated would cause controlled shutdown to fail. However, after the previous broker is fully caught up (in a few ms or seconds.. depending on how long it was down) it will succeed - i.e., if the retry interval times num.retries is set large enough to the expected period of underreplication it will succeed. There are a couple other scenarios that are outlined on the original controlled shutdown jira which I can look up later in the day if you are interested. Joel