[ https://issues.apache.org/jira/browse/KAFKA-6188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16482760#comment-16482760 ]
M. Manna commented on KAFKA-6188: --------------------------------- [~lindong] Hey Dong. Thanks for your valuable comments on the issue. I don't believe we can say it's a Disk issue. This has occurred in the past and with currently exists in 1.1.0. And I have tried it on a clean machine with a fresh Hard Disk. The problem persists without any disk issues. The bug is not because broker tries to access a file after the file has been deleted, but because of old segment kept open whilst it's being renamed - which can cause problems on any OS platform. The below is the javadoc for Log#replaceSegments method: {{ * The sequence of operations is:}} {{ * <ol>}} {{ * <li> Cleaner creates new segment with suffix .cleaned and invokes replaceSegments().}} {{ * If broker crashes at this point, the clean-and-swap operation is aborted and}} {{ * the .cleaned file is deleted on recovery in loadSegments()}} {{ * <li> New segment is renamed .swap. If the broker crashes after this point before the whole}} {{ * operation is completed, the swap operation is resumed on recovery as described in the next step.}} {{ * <li> Old segment files are renamed to .deleted and asynchronous delete is scheduled.}} {{ * If the broker crashes, any .deleted files left behind are deleted on recovery in loadSegments().}} {{ * replaceSegments() is then invoked to complete the swap with newSegment recreated from}} {{ * the .swap file and oldSegments containing segments which were not renamed before the crash.}} {{ * <li> Swap segment is renamed to replace the existing segment, completing this operation.}} {{ * If the broker crashes, any .deleted files which may be left behind are deleted}} {{ * on recovery in loadSegments().}} {{ * </ol>}} Before calling asycDeleteSegment(seg) - old segments are still open and this will cause a FIleSystemException. Also, before calling the following line, new segments are also open so it will cause problems. {{ // okay we are safe now, remove the swap suffix}} {{ newSegment.changeFileSuffixes(Log.SwapFileSuffix, "")}} Essentially, we are attempting to delete/rename a file which is already open - I believe this will corrupt things regardless of Windows/Linux/Mac platforms. > Broker fails with FATAL Shutdown - log dirs have failed > ------------------------------------------------------- > > Key: KAFKA-6188 > URL: https://issues.apache.org/jira/browse/KAFKA-6188 > Project: Kafka > Issue Type: Bug > Components: clients, log > Affects Versions: 1.0.0, 1.0.1 > Environment: Windows 10 > Reporter: Valentina Baljak > Priority: Blocker > Labels: windows > Attachments: kafka_2.10-0.10.2.1.zip, output.txt > > > Just started with version 1.0.0 after a 4-5 months of using 0.10.2.1. The > test environment is very simple, with only one producer and one consumer. > Initially, everything started fine, stand alone tests worked as expected. > However, running my code, Kafka clients fail after approximately 10 minutes. > Kafka won't start after that and it fails with the same error. > Deleting logs helps to start again, and the same problem occurs. > Here is the error traceback: > [2017-11-08 08:21:57,532] INFO Starting log cleanup with a period of 300000 > ms. (kafka.log.LogManager) > [2017-11-08 08:21:57,548] INFO Starting log flusher with a default period of > 9223372036854775807 ms. (kafka.log.LogManager) > [2017-11-08 08:21:57,798] INFO Awaiting socket connections on 0.0.0.0:9092. > (kafka.network.Acceptor) > [2017-11-08 08:21:57,813] INFO [SocketServer brokerId=0] Started 1 acceptor > threads (kafka.network.SocketServer) > [2017-11-08 08:21:57,829] INFO [ExpirationReaper-0-Produce]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-DeleteRecords]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [ExpirationReaper-0-Fetch]: Starting > (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper) > [2017-11-08 08:21:57,845] INFO [LogDirFailureHandler]: Starting > (kafka.server.ReplicaManager$LogDirFailureHandler) > [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Stopping serving > replicas in dir C:\Kafka\kafka_2.12-1.0.0\kafka-logs > (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,860] INFO [ReplicaManager broker=0] Partitions are > offline due to failure on log directory C:\Kafka\kafka_2.12-1.0.0\kafka-logs > (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,860] INFO [ReplicaFetcherManager on broker 0] Removed > fetcher for partitions (kafka.server.ReplicaFetcherManager) > [2017-11-08 08:21:57,892] INFO [ReplicaManager broker=0] Broker 0 stopped > fetcher for partitions because they are in the failed log dir > C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.server.ReplicaManager) > [2017-11-08 08:21:57,892] INFO Stopping serving logs in dir > C:\Kafka\kafka_2.12-1.0.0\kafka-logs (kafka.log.LogManager) > [2017-11-08 08:21:57,892] FATAL Shutdown broker because all log dirs in > C:\Kafka\kafka_2.12-1.0.0\kafka-logs have failed (kafka.log.LogManager) -- This message was sent by Atlassian JIRA (v7.6.3#76005)