[ 
https://issues.apache.org/jira/browse/KAFKA-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14496898#comment-14496898
 ] 

Jason Rosenberg commented on KAFKA-2125:
----------------------------------------

Ooh, it looks before the 500 repeated exceptions, the first one as a little 
different, which I neglected to include previously:
{code}
2015-04-14 05:56:10,973  INFO [kafka-request-handler-6] log.Log - Truncating 
log mytopic-1 to offset 13860934.
2015-04-14 05:56:10,980  WARN [Controller-45-to-broker-45-send-thread] 
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
Controller 45 epoch 21 fails to send request 
Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
 -> 
(LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
 to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
java.io.EOFException: Received -1 when reading from channel, socket has likely 
been closed.
        at kafka.utils.Utils$.read(Utils.scala:381)
        at 
kafka.network.BoundedByteBufferReceive.readFrom(BoundedByteBufferReceive.scala:54)
        at kafka.network.Receive$class.readCompletely(Transmission.scala:56)
        at 
kafka.network.BoundedByteBufferReceive.readCompletely(BoundedByteBufferReceive.scala:29)
        at kafka.network.BlockingChannel.receive(BlockingChannel.scala:111)
        at 
kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:133)
        at 
kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
        at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
2015-04-14 05:56:10,981  INFO [Controller-45-to-broker-45-send-thread] 
controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
Controller 45 connected to id:45,host:broker45.square,port:12345 for sending 
state change requests
2015-04-14 05:56:10,993 ERROR [kafka-network-thread-27330-2] utils.Utils$ - 
Uncaught exception in thread 'kafka-network-thread-12345-2':
java.util.NoSuchElementException: None.get
        at scala.None$.get(Option.scala:347)
        at scala.None$.get(Option.scala:345)
        at kafka.network.ConnectionQuotas.dec(SocketServer.scala:524)
        at kafka.network.AbstractServerThread.close(SocketServer.scala:165)
        at kafka.network.AbstractServerThread.close(SocketServer.scala:157)
        at kafka.network.Processor.close(SocketServer.scala:374)
        at kafka.network.AbstractServerThread.closeAll(SocketServer.scala:180)
        at kafka.network.Processor.run(SocketServer.scala:364)
        at java.lang.Thread.run(Thread.java:745)
{code}

> Infinite loop after controlled shutdown succeeds
> ------------------------------------------------
>
>                 Key: KAFKA-2125
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2125
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Jason Rosenberg
>            Priority: Blocker
>         Attachments: grep_shut_edited.log
>
>
> I have a 4 node cluster, running 0.8.2.1, that got into a bad state last 
> night during a rolling restart.  The first node to be restarted was the 
> controller.  Controlled Shutdown completed successfully, after about 800ms.  
> But after that, the server failed to shutdown, and got into what appears to 
> be an infinite cycle, involving the 'Controller-45-to-broker-45-send-thread', 
> and the 'kafka-scheduler-0' thread.  Ultimately, the shutdown timed out 
> (after 3 minutes) and it was terminated by the deployment system, and the 
> server was restarted.  As expected, when it came back up it took some time 
> re-syncing it's partitions, but eventually came back and all was well.  
> However, I think there was something fundamentally wrong with the shutdown 
> process.  The controller didn't seem to give up controller status, for one 
> thing, as part of the controlled shutdown (which I should think would be the 
> first thing it should do?).
> Anyway, below are some log snippets. I do have full logs from each broker in 
> the cluster, which I can provide (but would have to significantly anonymize 
> the logs before forwarding along).
> Controlled shutdown starts and succeeds:
> {code}
> 2015-04-14 05:56:10,134  INFO [Thread-38] server.KafkaServer - [Kafka Server 
> 45], shutting down
> 2015-04-14 05:56:10,136  INFO [Thread-38] server.KafkaServer - [Kafka Server 
> 45], Starting controlled shutdown
> 2015-04-14 05:56:10,150  INFO [kafka-request-handler-0] 
> controller.KafkaController - [Controller 45]: Shutting down broker 45
> ...
> ...
> 2015-04-14 05:56:10,951  INFO [Thread-38] server.KafkaServer - [Kafka Server 
> 45], Controlled shutdown succeeded
> {code}
> Subsequently, the 'Controller-45-to-broker-45-send-thread' starts repeatedly 
> spamming the logs, like so:
> {code}
> 2015-04-14 05:56:11,281  WARN [Controller-45-to-broker-45-send-thread] 
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
> Controller 45 epoch 21 fails to send request 
> Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
>  -> 
> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
>  to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
> java.nio.channels.ClosedChannelException
>         at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
>         at 
> kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
>         at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> 2015-04-14 05:56:11,281  INFO [Controller-45-to-broker-45-send-thread] 
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
> Controller 45 connected to id:45,host:broker45.square,port:12345 for sending 
> state change requests
> 2015-04-14 05:56:11,582  WARN [Controller-45-to-broker-45-send-thread] 
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
> Controller 45 epoch 21 fails to send request 
> Name:LeaderAndIsrRequest;Version:0;Controller:45;ControllerEpoch:21;CorrelationId:305175;ClientId:id_45-host_null-port_12345;Leaders:id:48,host:broker48.square,port:12345;PartitionState:(mytopic,1)
>  -> 
> (LeaderAndIsrInfo:(Leader:48,ISR:48,LeaderEpoch:5,ControllerEpoch:21),ReplicationFactor:2),AllReplicas:45,48)
>  to broker id:45,host:broker45.square,port:12345. Reconnecting to broker.
> java.nio.channels.ClosedChannelException
>         at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
>         at 
> kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
>         at 
> kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
>         at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
> 2015-04-14 05:56:11,582  INFO [Controller-45-to-broker-45-send-thread] 
> controller.RequestSendThread - [Controller-45-to-broker-45-send-thread], 
> Controller 45 connected to id:45,host:broker45.square,port:12345 for sending 
> state change requests
> {code}
> This seems to repeat every 300 ms or so, and continues until the server is 
> forcibly shutdown (for about 3 minutes in this case).  The partition being 
> logged about there (anonymized as 'mytopic,1') is always the same in 
> topic/partition in these repeated log messages. So perhaps that one partition 
> got in a bad state.
> Interspersed, there are floods of error logs like these, which is repeated 
> seemingly for every partition on the broker (the broker has about 450 
> partitions).  It's trying to shrink the ISR to just having the current broker 
> (EVEN THOUGH IT JUST COMPLETED CONTROLLED SHUTDOWN for all partitions). The 
> following sequence goes on in bursts of 500ms, once for each partition on the 
> box (and the recurring every 10 seconds):
> {code}
> 2015-04-14 05:56:23,716  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic2,0] on broker 45: Shrinking ISR for partition [mytopic2,0] 
> from 46,45 to 45
> 2015-04-14 05:56:23,720  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic2,0] on broker 45: Cached zkVersion [19] not equal to that 
> in zookeeper, skip updating ISR
> 2015-04-14 05:56:23,720  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic3,13] on broker 45: Shrinking ISR for partition 
> [mytopic3,13] from 45,48 to 45
> 2015-04-14 05:56:23,723  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic3,13] on broker 45: Cached zkVersion [8] not equal to that 
> in zookeeper, skip updating ISR
> 2015-04-14 05:56:23,723  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic4,12] on broker 45: Shrinking ISR for partition 
> [mytopic4,12] from 46,45 to 45
> 2015-04-14 05:56:23,726  INFO [kafka-scheduler-0] cluster.Partition - 
> Partition [mytopic4,12] on broker 45: Cached zkVersion [19] not equal to that 
> in zookeeper, skip updating ISR
> {code}
> The above sequence is repeated every 10 seconds, it seems.
> So there appear to be 2 different repeating non-terminating loops:
> 1. the Controller-45-to-broker-45-send-thread repeating every 300ms
> 2. the kafka-scheduler thread attempting to shrink the ISR for every 
> partition the box, every 10 seconds.
> It seems clear that this sequence would recur indefinitely had the app not 
> been killed hard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to