Neha, It looks like the StateChangeLogMergerTool takes state change logs as input. I'm not sure I know where those live? (Maybe update the doc on that wiki page to describe!).
Thanks, Jason On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede <neha.narkh...@gmail.com>wrote: > Jason, > > The state change log tool is described here - > > https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool > > I'm curious what the IOException is and if we can improve error reporting. > Can you send around the stack trace ? > > Thanks, > Neha > > > On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <j...@squareup.com> wrote: > > > Ok, > > > > Looking at the controlled shutdown code, it appears that it can fail with > > an IOException too, in which case it won't report the remaining > partitions > > to replicate, etc. (I think that might be what I'm seeing, since I never > > saw the log line for "controlled shutdown failed, X remaining > partitions", > > etc.). In my case, that may be the issue (it's happening during a > rolling > > restart, and the second of 3 nodes might be trying to shutdown before the > > first one has completely come back up). > > > > I've heard you guys mention several times now about controller and state > > change logs. But I don't know where those live (or how to configure). > > Please advise! > > > > Thanks, > > > > Jason > > > > > > On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede <neha.narkh...@gmail.com > > >wrote: > > > > > Controlled shutdown can fail if the cluster has non zero under > replicated > > > partition count. Since that means the leaders may not move off of the > > > broker being shutdown, causing controlled shutdown to fail. The backoff > > > might help if the under replication is just temporary due to a spike in > > > traffic. This is the most common reason it might fail besides bugs. But > > you > > > can check the logs to see why the shutdown failed. > > > > > > Thanks, > > > Neha > > > On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <j...@squareup.com> wrote: > > > > > > > I'm running into an issue where sometimes, the controlled shutdown > > fails > > > to > > > > complete after the default 3 retry attempts. This ended up in one > > case, > > > > with a broker under going an unclean shutdown, and then it was in a > > > rather > > > > bad state after restart. Producers would connect to the metadata > vip, > > > > still think that this broker was the leader, and then fail on that > > > leader, > > > > and then reconnect to to the metadata vip, and get sent back to that > > same > > > > failed broker! Does that make sense? > > > > > > > > I'm trying to understand the conditions which cause the controlled > > > shutdown > > > > to fail? There doesn't seem to be a setting for max amount of time > to > > > > wait, etc. > > > > > > > > It would be nice to specify how long to try before giving up > (hopefully > > > > giving up in a graceful way). > > > > > > > > Instead, we have a retry count, but it's not clear what this retry > > count > > > is > > > > really specifying, in terms of how long to keep trying, etc. > > > > > > > > Also, what are the ramifications for different settings for the > > > > controlled.shutdown.retry.backoff.ms? Is there a reason we want to > > wait > > > > before retrying again (again, it would be helpful to understand the > > > reasons > > > > for a controlled shutdown failure). > > > > > > > > Thanks, > > > > > > > > Jason > > > > > > > > > >