On Fri, Oct 25, 2013 at 3:22 PM, Jason Rosenberg <j...@squareup.com> wrote: > It looks like when the controlled shutdown failes with an IOException, the > exception is swallowed, and we see nothing in the logs: > > catch { > case ioe: java.io.IOException => > channel.disconnect() > channel = null > // ignore and try again > }
> Question.....what is the ramification of an 'unclean shutdown'? Is it no > different than a shutdown with no controlled shutdown ever attempted? Or > is it something more difficult to recover from? Unclean shutdown could result in data loss - since you are moving leadership to a replica that has fallen out of ISR. i.e., it's log end offset is behind the last committed message to this partition. > > I am still not clear on how to generate the state transition logs.....Does > the StateChangeLogManagerTool run against the main logs for the server (and > just collates entries there)? Take a look at the packaged log4j.properties file. The controller's partition/replica state machines and its channel manager which sends/receives leaderandisr requests/responses to brokers uses a stateChangeLogger. The replica managers on all brokers also use this logger. > > > This one eventually succeeds, after a mysterious failure: > > 2013-10-25 00:11:53,891 INFO [Thread-13] server.KafkaServer - [Kafka > Server 10], Starting controlled shutdown > ....<no exceptions between these log lines><no "Remaining partitions to > move....">.... > 2013-10-25 00:12:28,965 WARN [Thread-13] server.KafkaServer - [Kafka > Server 10], Retrying controlled shutdown after the previous attempt > failed... Our logging can improve - e.g., it looks like on controller movement we could retry without saying why. Thanks, Joel > > > > On Fri, Oct 25, 2013 at 12:51 PM, Jason Rosenberg <j...@squareup.com> wrote: > >> Neha, >> >> It looks like the StateChangeLogMergerTool takes state change logs as >> input. I'm not sure I know where those live? (Maybe update the doc on >> that wiki page to describe!). >> >> Thanks, >> >> Jason >> >> >> On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede >> <neha.narkh...@gmail.com>wrote: >> >>> Jason, >>> >>> The state change log tool is described here - >>> >>> https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool >>> >>> I'm curious what the IOException is and if we can improve error reporting. >>> Can you send around the stack trace ? >>> >>> Thanks, >>> Neha >>> >>> >>> On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <j...@squareup.com> >>> wrote: >>> >>> > Ok, >>> > >>> > Looking at the controlled shutdown code, it appears that it can fail >>> with >>> > an IOException too, in which case it won't report the remaining >>> partitions >>> > to replicate, etc. (I think that might be what I'm seeing, since I >>> never >>> > saw the log line for "controlled shutdown failed, X remaining >>> partitions", >>> > etc.). In my case, that may be the issue (it's happening during a >>> rolling >>> > restart, and the second of 3 nodes might be trying to shutdown before >>> the >>> > first one has completely come back up). >>> > >>> > I've heard you guys mention several times now about controller and state >>> > change logs. But I don't know where those live (or how to configure). >>> > Please advise! >>> > >>> > Thanks, >>> > >>> > Jason >>> > >>> > >>> > On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede < >>> neha.narkh...@gmail.com >>> > >wrote: >>> > >>> > > Controlled shutdown can fail if the cluster has non zero under >>> replicated >>> > > partition count. Since that means the leaders may not move off of the >>> > > broker being shutdown, causing controlled shutdown to fail. The >>> backoff >>> > > might help if the under replication is just temporary due to a spike >>> in >>> > > traffic. This is the most common reason it might fail besides bugs. >>> But >>> > you >>> > > can check the logs to see why the shutdown failed. >>> > > >>> > > Thanks, >>> > > Neha >>> > > On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <j...@squareup.com> wrote: >>> > > >>> > > > I'm running into an issue where sometimes, the controlled shutdown >>> > fails >>> > > to >>> > > > complete after the default 3 retry attempts. This ended up in one >>> > case, >>> > > > with a broker under going an unclean shutdown, and then it was in a >>> > > rather >>> > > > bad state after restart. Producers would connect to the metadata >>> vip, >>> > > > still think that this broker was the leader, and then fail on that >>> > > leader, >>> > > > and then reconnect to to the metadata vip, and get sent back to that >>> > same >>> > > > failed broker! Does that make sense? >>> > > > >>> > > > I'm trying to understand the conditions which cause the controlled >>> > > shutdown >>> > > > to fail? There doesn't seem to be a setting for max amount of time >>> to >>> > > > wait, etc. >>> > > > >>> > > > It would be nice to specify how long to try before giving up >>> (hopefully >>> > > > giving up in a graceful way). >>> > > > >>> > > > Instead, we have a retry count, but it's not clear what this retry >>> > count >>> > > is >>> > > > really specifying, in terms of how long to keep trying, etc. >>> > > > >>> > > > Also, what are the ramifications for different settings for the >>> > > > controlled.shutdown.retry.backoff.ms? Is there a reason we want to >>> > wait >>> > > > before retrying again (again, it would be helpful to understand the >>> > > reasons >>> > > > for a controlled shutdown failure). >>> > > > >>> > > > Thanks, >>> > > > >>> > > > Jason >>> > > > >>> > > >>> > >>> >> >>