Hello, hope you all are doing well, Am trying to stop gracefully a Broker with SIGTERM (-15). After almost 12 hours the process is still alive. I do not see any data/replication going in or out from this Broker. The following are the logs immediately after sending the SIGTERM signal to the process:
[2019-05-15 20:47:29,637] INFO Terminating process due to signal SIGTERM (org.apache.kafka.common.utils.LoggingSignalHandler) [2019-05-15 20:47:29,642] INFO [KafkaServer id=1] shutting down (kafka.server.KafkaServer) [2019-05-15 20:47:29,646] INFO [KafkaServer id=1] Starting controlled shutdown (kafka.server.KafkaServer) [2019-05-15 20:47:29,746] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(..., ... [2019-05-15 20:47:30,134] INFO [ReplicaAlterLogDirsManager on broker 1] Removed fetcher for partitions Set(__consumer_offsets-46) (kafka.server.ReplicaAlterLogDirsManager) [2019-05-15 20:47:30,134] INFO [KafkaServer id=1] Controlled shutdown succeeded (kafka.server.KafkaServer) [2019-05-15 20:47:30,135] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(...) (kafka.server.ReplicaFetcherManager) [2019-05-15 20:47:30,135] INFO [ReplicaAlterLogDirsManager on broker 1] Removed fetcher for partitions Set(...) (kafka.server.ReplicaAlterLogDirsManager) [2019-05-15 20:47:30,136] INFO [/config/changes-event-process-thread]: Shutting down (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread) [2019-05-15 20:47:30,136] INFO [/config/changes-event-process-thread]: Stopped (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread) [2019-05-15 20:47:30,136] INFO [/config/changes-event-process-thread]: Shutdown completed (kafka.common.ZkNodeChangeNotificationListener$ChangeEventProcessThread) [2019-05-15 20:47:30,137] INFO [SocketServer brokerId=1] Stopping socket server request processors (kafka.network.SocketServer) [2019-05-15 20:47:30,137] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions Set(...) (kafka.server.ReplicaFetcherManager) [2019-05-15 20:47:30,137] INFO [ReplicaAlterLogDirsManager on broker 1] Removed fetcher for partitions Set(...) (kafka.server.ReplicaAlterLogDirsManager) ... [2019-05-15 20:47:30,235] INFO [ReplicaAlterLogDirsManager on broker 1] Removed fetcher for partitions Set(...) (kafka.server.ReplicaAlterLogDirsManager) [2019-05-15 20:47:30,773] INFO [SocketServer brokerId=1] Stopped socket server request processors (kafka.network.SocketServer) [2019-05-15 20:47:30,773] INFO [data-plane Kafka Request Handler on Broker 1], shutting down (kafka.server.KafkaRequestHandlerPool) [2019-05-15 20:47:30,813] INFO [data-plane Kafka Request Handler on Broker 1], shut down completely (kafka.server.KafkaRequestHandlerPool) [2019-05-15 20:49:58,327] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager) ... [2019-05-15 22:47:03,400] INFO [ProducerStateManager partition=__consumer_offsets-40] Writing producer snapshot at offset 312043064 (kafka.log.ProducerStateManager) [2019-05-15 22:47:03,404] ERROR [ReplicaFetcher replicaId=1, leaderId=4, fetcherId=7] Error due to (kafka.server.ReplicaFetcherThread) org.apache.kafka.common.KafkaException: Error processing data for partition __consumer_offsets-40 offset 312043064 at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:342) at scala.Option.foreach(Option.scala:274) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6(AbstractFetcherThread.scala:300) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$6$adapted(AbstractFetcherThread.scala:299) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$5(AbstractFetcherThread.scala:299) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:299) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3(AbstractFetcherThread.scala:132) at kafka.server.AbstractFetcherThread.$anonfun$maybeFetch$3$adapted(AbstractFetcherThread.scala:131) at scala.Option.foreach(Option.scala:274) at kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:131) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:113) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Caused by: java.lang.IllegalStateException: Kafka scheduler is not running. at kafka.utils.KafkaScheduler.ensureRunning(KafkaScheduler.scala:140) at kafka.utils.KafkaScheduler.schedule(KafkaScheduler.scala:110) at kafka.log.Log.$anonfun$roll$2(Log.scala:1675) at kafka.log.Log.maybeHandleIOException(Log.scala:2013) at kafka.log.Log.roll(Log.scala:1616) at kafka.log.Log.maybeRoll(Log.scala:1601) at kafka.log.Log.$anonfun$append$2(Log.scala:934) at kafka.log.Log.maybeHandleIOException(Log.scala:2013) at kafka.log.Log.append(Log.scala:827) at kafka.log.Log.appendAsFollower(Log.scala:807) at kafka.cluster.Partition.$anonfun$doAppendRecordsToFollowerOrFutureReplica$1(Partition.scala:708) at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:251) at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:257) at kafka.cluster.Partition.doAppendRecordsToFollowerOrFutureReplica(Partition.scala:699) at kafka.cluster.Partition.appendRecordsToFollowerOrFutureReplica(Partition.scala:715) at kafka.server.ReplicaFetcherThread.processPartitionData(ReplicaFetcherThread.scala:157) at kafka.server.AbstractFetcherThread.$anonfun$processFetchRequest$7(AbstractFetcherThread.scala:311) ... 16 more [2019-05-15 22:47:03,404] INFO [ReplicaFetcher replicaId=1, leaderId=4, fetcherId=7] Stopped (kafka.server.ReplicaFetcherThread) [2019-05-15 22:49:58,327] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager) ... After almost 12 hours and the logs showing that the Broker stopped gracefully, it continues showing logs like: [2019-05-16 09:09:58,327] INFO [GroupMetadataManager brokerId=1] Removed 0 expired offsets in 0 milliseconds. (kafka.coordinator.group.GroupMetadataManager) But the process is still alive. According to the symptoms, what could/should I check? I hope you guys can give me a clue. Cheers! -- Santilli Jonathan