Hi, Mayuresh. No, this log before restart 61. But I found some interesting logs about ZK on problem broker:
root@kafka3d:~# zgrep 'zookeeper state changed (Expired)' /var/log/kafka/*/* /var/log/kafka/2015-10-30/kafka-2015-10-30.log.gz:[2015-10-30 23:02:31,001] 284371992 [main-EventThread] INFO org.I0Itec.zkclient.ZkClient - zookeeper state changed (Expired) root@kafka3d:~# zgrep -i shut /var/log/kafka/2015-10-30/kafka-2015-10-30.log.gz [2015-10-30 23:02:31,004] 284371995 [main-EventThread] INFO org.apache.zookeeper.ClientCnxn - EventThread shut down at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) [2015-10-30 23:10:22,206] 284843197 [kafka-request-handler-2] INFO kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-77], Shutting down [2015-10-30 23:10:22,213] 284843204 [kafka-request-handler-2] INFO kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-77], Shutdown completed [2015-10-30 23:10:22,213] 284843204 [kafka-request-handler-2] INFO kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-1-77], Shutting down [2015-10-30 23:10:22,215] 284843206 [kafka-request-handler-2] INFO kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-1-77], Shutdown completed [2015-10-30 23:10:22,215] 284843206 [kafka-request-handler-2] INFO kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-2-77], Shutting down [2015-10-30 23:10:22,396] 284843387 [kafka-request-handler-2] INFO kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-2-77], Shutdown completed [2015-10-30 23:10:22,396] 284843387 [kafka-request-handler-2] INFO kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-3-77], Shutting down [2015-10-30 23:10:22,439] 284843430 [kafka-request-handler-2] INFO kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-3-77], Shutdown completed Is it related to my GC settings? KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35" KAFKA_HEAP_OPTS="-Xmx8G" Also I attached some GC graphs from JMX On Tue, Nov 3, 2015 at 1:21 AM, Mayuresh Gharat <gharatmayures...@gmail.com> wrote: > The broker 61 some how falls behind in fetching from the leader brokers and > hence falls out of the ISR. > > [2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state > change of replica 61 for partition [test-res-met.server_logs.conv,18] from > OnlineReplica to OfflineReplica... > means that the current controller underwent a failure and came back up, but > some other controller was elected in meant time. The old controller will > eventually resign. > Is this log after you rebounce 61? > > > Thanks, > > Mayuresh > > On Sat, Oct 31, 2015 at 5:09 AM, Gleb Zhukov <gzhu...@iponweb.net> wrote: > > > Hi, Everybody! > > > > Every week on Friday's night I lose ISR for some partitions in my kafka > > cluster: > > > > Topic: test-res-met.server_logs.conv Partition: 18 Leader: 45 > > Replicas: 45,61 Isr: 45 > > Current controller: 45 > > Partitions with leader #61 are available, I lose broker #61 only as ISR > for > > partitions with another leader. > > > > State logs on broker 61: > > > > [2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state > > change of replica 61 for partition [test-res-met.server_logs.conv,18] > from > > OnlineReplica to OfflineReplic > > a failed (state.change.logger) > > kafka.common.StateChangeFailedException: Leader and isr path written by > > another controller. This probablymeans the current controller with epoch > > 2233 went through a soft failure > > and another controller was elected with epoch 2234. Aborting state change > > by this controller > > at > > > > > kafka.controller.KafkaController.removeReplicaFromIsr(KafkaController.scala:1002) > > at > > > > > kafka.controller.ReplicaStateMachine.handleStateChange(ReplicaStateMachine.scala:250) > > at > > > > > kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114) > > at > > > > > kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114) > > at > > scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153) > > at > > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) > > at > > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) > > at > > scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) > > at > > > > > kafka.controller.ReplicaStateMachine.handleStateChanges(ReplicaStateMachine.scala:114) > > at > > > kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:451) > > at > > > > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:373) > > at > > > > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) > > at > > > > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) > > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) > > at > > > > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358) > > at > > > > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) > > at > > > > > kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) > > at kafka.utils.Utils$.inLock(Utils.scala:535) > > at > > > > > kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356) > > at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568) > > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) > > > > Restart of bad broker (#61) helps. > > We have 7day retention for our logs (log.retention.hours=168). Also I > > checked ZK and cron. Could anyone explain such issue? Kafka 0.8.2.1. > > > > > > -- > -Regards, > Mayuresh R. Gharat > (862) 250-7125 > -- Best regards, Gleb Zhukov IPONWEB