I am not sure about this. It might be related to your GC settings. But I am not sure why it only occurs on Friday night.
Thanks, Mayuresh On Tue, Nov 3, 2015 at 3:01 AM, Gleb Zhukov <gzhu...@iponweb.net> wrote: > Hi, Mayuresh. No, this log before restart 61. > But I found some interesting logs about ZK on problem broker: > > root@kafka3d:~# zgrep 'zookeeper state changed (Expired)' > /var/log/kafka/*/* > /var/log/kafka/2015-10-30/kafka-2015-10-30.log.gz:[2015-10-30 > 23:02:31,001] 284371992 [main-EventThread] INFO > org.I0Itec.zkclient.ZkClient - zookeeper state changed (Expired) > > root@kafka3d:~# zgrep -i shut > /var/log/kafka/2015-10-30/kafka-2015-10-30.log.gz > [2015-10-30 23:02:31,004] 284371995 [main-EventThread] INFO > org.apache.zookeeper.ClientCnxn - EventThread shut down > at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60) > [2015-10-30 23:10:22,206] 284843197 [kafka-request-handler-2] INFO > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-77], Shutting > down > [2015-10-30 23:10:22,213] 284843204 [kafka-request-handler-2] INFO > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-0-77], Shutdown > completed > [2015-10-30 23:10:22,213] 284843204 [kafka-request-handler-2] INFO > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-1-77], Shutting > down > [2015-10-30 23:10:22,215] 284843206 [kafka-request-handler-2] INFO > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-1-77], Shutdown > completed > [2015-10-30 23:10:22,215] 284843206 [kafka-request-handler-2] INFO > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-2-77], Shutting > down > [2015-10-30 23:10:22,396] 284843387 [kafka-request-handler-2] INFO > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-2-77], Shutdown > completed > [2015-10-30 23:10:22,396] 284843387 [kafka-request-handler-2] INFO > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-3-77], Shutting > down > [2015-10-30 23:10:22,439] 284843430 [kafka-request-handler-2] INFO > kafka.server.ReplicaFetcherThread - [ReplicaFetcherThread-3-77], Shutdown > completed > > Is it related to my GC settings? > > KAFKA_JVM_PERFORMANCE_OPTS="-server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 > -XX:InitiatingHeapOccupancyPercent=35" > KAFKA_HEAP_OPTS="-Xmx8G" > > Also I attached some GC graphs from JMX > > > > On Tue, Nov 3, 2015 at 1:21 AM, Mayuresh Gharat < > gharatmayures...@gmail.com> wrote: > >> The broker 61 some how falls behind in fetching from the leader brokers >> and >> hence falls out of the ISR. >> >> [2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state >> change of replica 61 for partition [test-res-met.server_logs.conv,18] from >> OnlineReplica to OfflineReplica... >> means that the current controller underwent a failure and came back up, >> but >> some other controller was elected in meant time. The old controller will >> eventually resign. >> Is this log after you rebounce 61? >> >> >> Thanks, >> >> Mayuresh >> >> On Sat, Oct 31, 2015 at 5:09 AM, Gleb Zhukov <gzhu...@iponweb.net> wrote: >> >> > Hi, Everybody! >> > >> > Every week on Friday's night I lose ISR for some partitions in my kafka >> > cluster: >> > >> > Topic: test-res-met.server_logs.conv Partition: 18 Leader: 45 >> > Replicas: 45,61 Isr: 45 >> > Current controller: 45 >> > Partitions with leader #61 are available, I lose broker #61 only as ISR >> for >> > partitions with another leader. >> > >> > State logs on broker 61: >> > >> > [2015-10-30 23:02:34,012] ERROR Controller 61 epoch 2233 initiated state >> > change of replica 61 for partition [test-res-met.server_logs.conv,18] >> from >> > OnlineReplica to OfflineReplic >> > a failed (state.change.logger) >> > kafka.common.StateChangeFailedException: Leader and isr path written by >> > another controller. This probablymeans the current controller with epoch >> > 2233 went through a soft failure >> > and another controller was elected with epoch 2234. Aborting state >> change >> > by this controller >> > at >> > >> > >> kafka.controller.KafkaController.removeReplicaFromIsr(KafkaController.scala:1002) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine.handleStateChange(ReplicaStateMachine.scala:250) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$$anonfun$handleStateChanges$2.apply(ReplicaStateMachine.scala:114) >> > at >> > scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:153) >> > at >> > >> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) >> > at >> > >> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) >> > at >> > >> scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:306) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine.handleStateChanges(ReplicaStateMachine.scala:114) >> > at >> > >> kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:451) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ReplicaStateMachine.scala:373) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1$$anonfun$apply$mcV$sp$1.apply(ReplicaStateMachine.scala:359) >> > at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:33) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply$mcV$sp(ReplicaStateMachine.scala:358) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$BrokerChangeListener$$anonfun$handleChildChange$1.apply(ReplicaStateMachine.scala:357) >> > at kafka.utils.Utils$.inLock(Utils.scala:535) >> > at >> > >> > >> kafka.controller.ReplicaStateMachine$BrokerChangeListener.handleChildChange(ReplicaStateMachine.scala:356) >> > at org.I0Itec.zkclient.ZkClient$7.run(ZkClient.java:568) >> > at org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:71) >> > >> > Restart of bad broker (#61) helps. >> > We have 7day retention for our logs (log.retention.hours=168). Also I >> > checked ZK and cron. Could anyone explain such issue? Kafka 0.8.2.1. >> > >> >> >> >> -- >> -Regards, >> Mayuresh R. Gharat >> (862) 250-7125 >> > > > > -- > Best regards, > Gleb Zhukov > IPONWEB > -- -Regards, Mayuresh R. Gharat (862) 250-7125