Jason, The state change log tool is described here - https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool
I'm curious what the IOException is and if we can improve error reporting. Can you send around the stack trace ? Thanks, Neha On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <j...@squareup.com> wrote: > Ok, > > Looking at the controlled shutdown code, it appears that it can fail with > an IOException too, in which case it won't report the remaining partitions > to replicate, etc. (I think that might be what I'm seeing, since I never > saw the log line for "controlled shutdown failed, X remaining partitions", > etc.). In my case, that may be the issue (it's happening during a rolling > restart, and the second of 3 nodes might be trying to shutdown before the > first one has completely come back up). > > I've heard you guys mention several times now about controller and state > change logs. But I don't know where those live (or how to configure). > Please advise! > > Thanks, > > Jason > > > On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <neha.narkh...@gmail.com > >wrote: > > > Controlled shutdown can fail if the cluster has non zero under replicated > > partition count. Since that means the leaders may not move off of the > > broker being shutdown, causing controlled shutdown to fail. The backoff > > might help if the under replication is just temporary due to a spike in > > traffic. This is the most common reason it might fail besides bugs. But > you > > can check the logs to see why the shutdown failed. > > > > Thanks, > > Neha > > On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <j...@squareup.com> wrote: > > > > > I'm running into an issue where sometimes, the controlled shutdown > fails > > to > > > complete after the default 3 retry attempts. This ended up in one > case, > > > with a broker under going an unclean shutdown, and then it was in a > > rather > > > bad state after restart. Producers would connect to the metadata vip, > > > still think that this broker was the leader, and then fail on that > > leader, > > > and then reconnect to to the metadata vip, and get sent back to that > same > > > failed broker! Does that make sense? > > > > > > I'm trying to understand the conditions which cause the controlled > > shutdown > > > to fail? There doesn't seem to be a setting for max amount of time to > > > wait, etc. > > > > > > It would be nice to specify how long to try before giving up (hopefully > > > giving up in a graceful way). > > > > > > Instead, we have a retry count, but it's not clear what this retry > count > > is > > > really specifying, in terms of how long to keep trying, etc. > > > > > > Also, what are the ramifications for different settings for the > > > controlled.shutdown.retry.backoff.ms? Is there a reason we want to > wait > > > before retrying again (again, it would be helpful to understand the > > reasons > > > for a controlled shutdown failure). > > > > > > Thanks, > > > > > > Jason > > > > > >