[jira] [Comment Edited] (KAFKA-6188) Broker fails with FATAL Shutdown - log dirs have failed

M. Manna (JIRA) Fri, 18 May 2018 07:02:19 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16480686#comment-16480686
 ]


M. Manna edited comment on KAFKA-6188 at 5/18/18 2:01 PM:
----------------------------------------------------------

[~yuzhih...@gmail.com] I am constantly hitting blocker at this location. There 
are two FATAL shutdowns and this one is causing issues during startup 
(LogManager):

 

{{ // dir should be an absolute path}}
 {{  def handleLogDirFailure(dir: String) {}}
 {{    info(s"Stopping serving logs in dir $dir")}}
 {{    logCreationOrDeletionLock synchronized {}}
 {{      _liveLogDirs.remove(new File(dir))}}
 {{      if (_liveLogDirs.isEmpty) {}}
 {{        fatal(s"Shutdown broker because all log dirs in 
${logDirs.mkString(", ")} have failed")}}
 {{        Exit.halt(1)}}
 {

{      }

}}

Since the queue is empty (i.e. no valid/clean log directoy exists) it wants to 
fatally shutdown. If we stop the shutdown here, does it mean that it will keep 
failing gracefully and log segments/offsets will keep growing and violating the 
scheduled retention/cleanup policy ?


 Additionally, Do you believe that a KIP is needed to address a "blocking" log 
cleanup logic such that

1) Old files are closed, renamed and copied so that new segments can be 
"opened" for writing.

2) Renaming occurs for all old segments - since they are closed, we can safely 
delete remove them

3) All the above should use some locking/unlocking where applicable.

 

 

 


was (Author: manme...@gmail.com):
[~yuzhih...@gmail.com] I am constantly hitting blocker at this location. There 
are two FATAL shutdowns and this one is causing issues during startup 
(LogManager):

 

{{ // dir should be an absolute path}}
 {{  def handleLogDirFailure(dir: String) {}}
 {{    info(s"Stopping serving logs in dir $dir")}}
 {{    logCreationOrDeletionLock synchronized {}}
 {{      _liveLogDirs.remove(new File(dir))}}
 {{      if (_liveLogDirs.isEmpty) {}}
 {{        fatal(s"Shutdown broker because all log dirs in 
${logDirs.mkString(", ")} have failed")}}
 {{        Exit.halt(1)}}
 \{{      }}}

Since the queue is empty (i.e. no valid/clean log directoy exists) it wants to 
fatally shutdown. If we stop the shutdown here, does it mean that it will keep 
failing gracefully and messages will keep growing ? 
 Additionally, Do you believe that a KIP is needed to address a "blocking" log 
cleanup logic such that

1) Old files are closed, renamed and copied so that new segments can be 
"opened" for writing.

2) Renaming occurs for all old segments - since they are closed, we can safely 
delete remove them

3) All the above should use some locking/unlocking where applicable.

 

 

 

> Broker fails with FATAL Shutdown - log dirs have failed
> -------------------------------------------------------
>
>                 Key: KAFKA-6188
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6188
>             Project: Kafka
>          Issue Type: Bug
>          Components: clients, log
>    Affects Versions: 1.0.0, 1.0.1
>         Environment: Windows 10
>            Reporter: Valentina Baljak
>            Priority: Blocker
>              Labels: windows
>         Attachments: kafka_2.10-0.10.2.1.zip, output.txt
>
>
> Just started with version 1.0.0 after a 4-5 months of using 0.10.2.1. The 
> test environment is very simple, with only one producer and one consumer. 
> Initially, everything started fine, stand alone tests worked as expected. 
> However, running my code, Kafka clients fail after approximately 10 minutes. 
> Kafka won't start after that and it fails with the same error. 
> Deleting logs helps to start again, and the same problem occurs.
> Here is the error traceback:
> [2017-11-08 08:21:57,532] INFO Starting log cleanup with a period of 300000 
> ms. (kafka.log.LogManager)
> [2017-11-08 08:21:57,548] INFO Starting log flusher with a default period of 
> 9223372036854775807 ms. (kafka.log.LogManager)
> [2017-11-08 08:21:57,798] INFO Awaiting socket connections on 0.0.0.0:9092. 
> (kafka.network.Acceptor)
> [2017-11-08 08:21:57,813] INFO [SocketServer brokerId=0] Started 1 acceptor 
> threads (kafka.network.SocketServer)
> [2017-11-08 08:21:57,829] INFO [ExpirationReaper-0-Produce]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-DeleteRecords]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-Fetch]: Starting 
> (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
> [2017-11-08 08:21:57,845] INFO [LogDirFailureHandler]: Starting 
> (kafka.server.ReplicaManager$LogDirFailureHandler)
> [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Stopping serving 
> replicas in dir C:\Kafka\kafka_2.12-1.0.0\kafka-logs 
> (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Partitions  are 
> offline due to failure on log directory C:\Kafka\kafka_2.12-1.0.0\kafka-logs 
> (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,860] INFO [ReplicaFetcherManager on broker 0] Removed 
> fetcher for partitions  (kafka.server.ReplicaFetcherManager)
> [2017-11-08 08:21:57,892] INFO [ReplicaManager broker=0] Broker 0 stopped 
> fetcher for partitions  because they are in the failed log dir 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.server.ReplicaManager)
> [2017-11-08 08:21:57,892] INFO Stopping serving logs in dir 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.log.LogManager)
> [2017-11-08 08:21:57,892] FATAL Shutdown broker because all log dirs in 
> C:\Kafka\kafka_2.12-1.0.0\kafka-logs have failed (kafka.log.LogManager)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (KAFKA-6188) Broker fails with FATAL Shutdown - log dirs have failed

Reply via email to