[
https://issues.apache.org/jira/browse/KAFKA-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497616#comment-14497616
]
Jason Rosenberg commented on KAFKA-2125:
----------------------------------------
So, I'd think the exception there before the 'NoSuchElementException' seems
relevant (since it is complaining about the 1 partition that is then complained
about repeatedly following....
> Infinite loop after controlled shutdown succeeds
> ------------------------------------------------
>
> Key: KAFKA-2125
> URL: https://issues.apache.org/jira/browse/KAFKA-2125
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8.2.1
> Reporter: Jason Rosenberg
> Priority: Blocker
> Attachments: grep_shut_edited.log
>
>
> I have a 4 node cluster, running 0.8.2.1, that got into a bad state last
> night during a rolling restart. The first node to be restarted was the
> controller. Controlled Shutdown completed successfully, after about 800ms.
> But after that, the server failed to shutdown, and got into what appears to
> be an infinite cycle, involving the 'Controller-45-to-broker-45-send-thread',
> and the 'kafka-scheduler-0' thread. Ultimately, the shutdown timed out
> (after 3 minutes) and it was terminated by the deployment system, and the
> server was restarted. As expected, when it came back up it took some time
> re-syncing it's partitions, but eventually came back and all was well.
> However, I think there was something fundamentally wrong with the shutdown
> process. The controller didn't seem to give up controller status, for one
> thing, as part of the controlled shutdown (which I should think would be the
> first thing it should do?).
> Anyway, below are some log snippets. I do have full logs from each broker in
> the cluster, which I can provide (but would have to significantly anonymize
> the logs before forwarding along).
> Controlled shutdown starts and succeeds:
> {code}
> 2015-04-14 05:56:10,134 INFO [Thread-38] server.KafkaServer - [Kafka Server
> 45], shutting down
> 2015-04-14 05:56:10,136 INFO [Thread-38] server.KafkaServer - [Kafka Server
> 45], Starting controlled shutdown
> 2015-04-14 05:56:10,150 INFO [kafka-request-handler-0]
> controller.KafkaController - [Controller 45]: Shutting down broker 45
> ...
> ...
> 2015-04-14 05:56:10,951 INFO [Thread-38] server.KafkaServer - [Kafka Server
> 45], Controlled shutdown succeeded
> {code}
> Subsequently, the 'Controller-45-to-broker-45-send-thread' starts repeatedly
> spamming the logs, like so:
> {code}
> 2015-04-14 05:56:11,281 WARN [Controller-45-to-broker-45-send-thread]
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread],
> Controller 45 epoch 21 fails to send request
> Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
> ->
> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
> to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
> java.nio.channels.ClosedChannelException
> at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
> at
> kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
> at
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> 2015-04-14 05:56:11,281 INFO [Controller-45-to-broker-45-send-thread]
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread],
> Controller 45 connected to id:45,host:broker45.square,port:12345 for sending
> state change requests
> 2015-04-14 05:56:11,582 WARN [Controller-45-to-broker-45-send-thread]
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread],
> Controller 45 epoch 21 fails to send request
> Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
> ->
> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
> to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
> java.nio.channels.ClosedChannelException
> at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
> at
> kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
> at
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
> at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> 2015-04-14 05:56:11,582 INFO [Controller-45-to-broker-45-send-thread]
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread],
> Controller 45 connected to id:45,host:broker45.square,port:12345 for sending
> state change requests
> {code}
> This seems to repeat every 300 ms or so, and continues until the server is
> forcibly shutdown (for about 3 minutes in this case). The partition being
> logged about there (anonymized as 'mytopic,1') is always the same in
> topic/partition in these repeated log messages. So perhaps that one partition
> got in a bad state.
> Interspersed, there are floods of error logs like these, which is repeated
> seemingly for every partition on the broker (the broker has about 450
> partitions). It's trying to shrink the ISR to just having the current broker
> (EVEN THOUGH IT JUST COMPLETED CONTROLLED SHUTDOWN for all partitions). The
> following sequence goes on in bursts of 500ms, once for each partition on the
> box (and the recurring every 10 seconds):
> {code}
> 2015-04-14 05:56:23,716 INFO [kafka-scheduler-0] cluster.Partition -
> Partition [mytopic2,0] on broker 45: Shrinking ISR for partition [mytopic2,0]
> from 46,45 to 45
> 2015-04-14 05:56:23,720 INFO [kafka-scheduler-0] cluster.Partition -
> Partition [mytopic2,0] on broker 45: Cached zkVersion [19] not equal to that
> in zookeeper, skip updating ISR
> 2015-04-14 05:56:23,720 INFO [kafka-scheduler-0] cluster.Partition -
> Partition [mytopic3,13] on broker 45: Shrinking ISR for partition
> [mytopic3,13] from 45,48 to 45
> 2015-04-14 05:56:23,723 INFO [kafka-scheduler-0] cluster.Partition -
> Partition [mytopic3,13] on broker 45: Cached zkVersion [8] not equal to that
> in zookeeper, skip updating ISR
> 2015-04-14 05:56:23,723 INFO [kafka-scheduler-0] cluster.Partition -
> Partition [mytopic4,12] on broker 45: Shrinking ISR for partition
> [mytopic4,12] from 46,45 to 45
> 2015-04-14 05:56:23,726 INFO [kafka-scheduler-0] cluster.Partition -
> Partition [mytopic4,12] on broker 45: Cached zkVersion [19] not equal to that
> in zookeeper, skip updating ISR
> {code}
> The above sequence is repeated every 10 seconds, it seems.
> So there appear to be 2 different repeating non-terminating loops:
> 1. the Controller-45-to-broker-45-send-thread repeating every 300ms
> 2. the kafka-scheduler thread attempting to shrink the ISR for every
> partition the box, every 10 seconds.
> It seems clear that this sequence would recur indefinitely had the app not
> been killed hard.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)