It looks like when the controlled shutdown failes with an IOException, the exception is swallowed, and we see nothing in the logs:
catch { case ioe: java.io.IOException => channel.disconnect() channel = null // ignore and try again } So, I don't really have visibility into why controlled shutdown fails when it does.....Below are some logging examples. The first one eventually succeeds, after a mysterious error. The second one exhibits the issue suggested, ("no other replicas in ISR"). So, I'll try adding more retries and more back off delay..... Question.....what is the ramification of an 'unclean shutdown'? Is it no different than a shutdown with no controlled shutdown ever attempted? Or is it something more difficult to recover from? I am still not clear on how to generate the state transition logs.....Does the StateChangeLogManagerTool run against the main logs for the server (and just collates entries there)? This one eventually succeeds, after a mysterious failure: 2013-10-25 00:11:53,891 INFO [Thread-13] server.KafkaServer - [Kafka Server 10], Starting controlled shutdown ....<no exceptions between these log lines><no "Remaining partitions to move....">.... 2013-10-25 00:12:28,965 WARN [Thread-13] server.KafkaServer - [Kafka Server 10], Retrying controlled shutdown after the previous attempt failed... .... 2013-10-25 00:12:56,623 INFO [Thread-13] server.KafkaServer - [Kafka Server 10], Controlled shutdown succeeded This one fails ultimately, and proceeds with unclean shutdown: 2013-10-25 20:39:10,350 INFO [Thread-12] server.KafkaServer - [Kafka Server 11], Starting controlled shutdown ... <lots of exceptions like this><no "Remaining partitions to move....">.... 2013-10-25 20:39:40,735 ERROR [kafka-request-handler-4] change.logger - Controller 11 epoch 187 encountered error while electing leader for partition [topicX,0] due to: No other replicas in ISR 11 for [topicX,0] besides current leader 11 and shutting down brokers 11. 2013-10-25 20:39:40,735 ERROR [kafka-request-handler-4] change.logger - Controller 11 epoch 187 initiated state change for partition [topicX,0] from OnlinePartition to OnlinePartition failed kafka.common.StateChangeFailedException: encountered error while electing leader for partition [topicX,0] due to: No other replicas in ISR 11 for [topicX,0] besides current leader 11 and shutting down brokers 11. at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:328) at kafka.controller.PartitionStateMachine.kafka$controller$PartitionStateMachine$$handleStateChange(PartitionStateMachine.scala:155) at kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:111) at kafka.controller.PartitionStateMachine$$anonfun$handleStateChanges$2.apply(PartitionStateMachine.scala:110) at scala.collection.immutable.Set$Set1.foreach(Set.scala:81) at kafka.controller.PartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:110) at kafka.controller.KafkaController$$anonfun$shutdownBroker$4$$anonfun$apply$2.apply(KafkaController.scala:188) at kafka.controller.KafkaController$$anonfun$shutdownBroker$4$$anonfun$apply$2.apply(KafkaController.scala:184) at scala.Option.foreach(Option.scala:121) at kafka.controller.KafkaController$$anonfun$shutdownBroker$4.apply(KafkaController.scala:184) at kafka.controller.KafkaController$$anonfun$shutdownBroker$4.apply(KafkaController.scala:180) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:57) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:43) at kafka.controller.KafkaController.shutdownBroker(KafkaController.scala:180) at kafka.server.KafkaApis.handleControlledShutdownRequest(KafkaApis.scala:133) at kafka.server.KafkaApis.handle(KafkaApis.scala:72) at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:42) at java.lang.Thread.run(Thread.java:662) Caused by: kafka.common.StateChangeFailedException: No other replicas in ISR 11 for [topicX,0] besides current leader 11 and shutting down brokers 11 at kafka.controller.ControlledShutdownLeaderSelector.selectLeader(PartitionLeaderSelector.scala:177) at kafka.controller.PartitionStateMachine.electLeaderForPartition(PartitionStateMachine.scala:304) ... 17 more ... 2013-10-25 20:39:45,404 WARN [Thread-12] server.KafkaServer - [Kafka Server 11], Retrying controlled shutdown after the previous attempt failed... <lots more of the StateChangeFailedExceptions><no "Remaining partitions to move....">.... 2013-10-25 20:40:20,499 WARN [Thread-12] server.KafkaServer - [Kafka Server 11], Retrying controlled shutdown after the previous attempt failed... <lots more of the StateChangeFailedExceptions><no "Remaining partitions to move....">.... 2013-10-25 20:40:55,598 WARN [Thread-12] server.KafkaServer - [Kafka Server 11], Retrying controlled shutdown after the previous attempt failed... 2013-10-25 20:40:55,598 WARN [Thread-12] server.KafkaServer - [Kafka Server 11], Proceeding to do an unclean shutdown as all the controlled shutdown attempts failed So, this would seem to indicate the issue described previously (no leader for partition so unclean shutdown)..... So, I'll try adding more retries and more back off delay..... Question.....what is the ramification of an 'unclean shutdown'? Is it no different than a shutdown with no controlled shutdown ever attempted? Or is it something more difficult to recover from? On Fri, Oct 25, 2013 at 12:51 PM, Jason Rosenberg <j...@squareup.com> wrote: > Neha, > > It looks like the StateChangeLogMergerTool takes state change logs as > input. I'm not sure I know where those live? (Maybe update the doc on > that wiki page to describe!). > > Thanks, > > Jason > > > On Fri, Oct 25, 2013 at 12:38 PM, Neha Narkhede > <neha.narkh...@gmail.com>wrote: > >> Jason, >> >> The state change log tool is described here - >> >> https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-7.StateChangeLogMergerTool >> >> I'm curious what the IOException is and if we can improve error reporting. >> Can you send around the stack trace ? >> >> Thanks, >> Neha >> >> >> On Fri, Oct 25, 2013 at 8:26 AM, Jason Rosenberg <j...@squareup.com> >> wrote: >> >> > Ok, >> > >> > Looking at the controlled shutdown code, it appears that it can fail >> with >> > an IOException too, in which case it won't report the remaining >> partitions >> > to replicate, etc. (I think that might be what I'm seeing, since I >> never >> > saw the log line for "controlled shutdown failed, X remaining >> partitions", >> > etc.). In my case, that may be the issue (it's happening during a >> rolling >> > restart, and the second of 3 nodes might be trying to shutdown before >> the >> > first one has completely come back up). >> > >> > I've heard you guys mention several times now about controller and state >> > change logs. But I don't know where those live (or how to configure). >> > Please advise! >> > >> > Thanks, >> > >> > Jason >> > >> > >> > On Fri, Oct 25, 2013 at 10:40 AM, Neha Narkhede < >> neha.narkh...@gmail.com >> > >wrote: >> > >> > > Controlled shutdown can fail if the cluster has non zero under >> replicated >> > > partition count. Since that means the leaders may not move off of the >> > > broker being shutdown, causing controlled shutdown to fail. The >> backoff >> > > might help if the under replication is just temporary due to a spike >> in >> > > traffic. This is the most common reason it might fail besides bugs. >> But >> > you >> > > can check the logs to see why the shutdown failed. >> > > >> > > Thanks, >> > > Neha >> > > On Oct 25, 2013 1:18 AM, "Jason Rosenberg" <j...@squareup.com> wrote: >> > > >> > > > I'm running into an issue where sometimes, the controlled shutdown >> > fails >> > > to >> > > > complete after the default 3 retry attempts. This ended up in one >> > case, >> > > > with a broker under going an unclean shutdown, and then it was in a >> > > rather >> > > > bad state after restart. Producers would connect to the metadata >> vip, >> > > > still think that this broker was the leader, and then fail on that >> > > leader, >> > > > and then reconnect to to the metadata vip, and get sent back to that >> > same >> > > > failed broker! Does that make sense? >> > > > >> > > > I'm trying to understand the conditions which cause the controlled >> > > shutdown >> > > > to fail? There doesn't seem to be a setting for max amount of time >> to >> > > > wait, etc. >> > > > >> > > > It would be nice to specify how long to try before giving up >> (hopefully >> > > > giving up in a graceful way). >> > > > >> > > > Instead, we have a retry count, but it's not clear what this retry >> > count >> > > is >> > > > really specifying, in terms of how long to keep trying, etc. >> > > > >> > > > Also, what are the ramifications for different settings for the >> > > > controlled.shutdown.retry.backoff.ms? Is there a reason we want to >> > wait >> > > > before retrying again (again, it would be helpful to understand the >> > > reasons >> > > > for a controlled shutdown failure). >> > > > >> > > > Thanks, >> > > > >> > > > Jason >> > > > >> > > >> > >> > >