Hello, Our cluster is going down one-by-one during log cleanup. This is after we have done full upgrade from 2.10-0.10.2.1.
This is the log we receive: [2018-05-11 17:12:21,652] WARN [ReplicaFetcher replicaId=1, leaderId=2, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=1, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={__consumer_offsets-28=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-43=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-10=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-25=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-7=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-40=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-37=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-22=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-4=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-34=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-19=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-16=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-49=(offset=0, logStartOffset=0, maxBytes=1048576), supplier_db-0=(offset=39, logStartOffset=0, maxBytes=1048576), __consumer_offset =(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-31=(offset=500, logStartOffset=0, maxBytes=1048576), __consumer_offsets-13=(offset=0, logStartOffset=0, maxBytes=1048576), __consumer_offsets-46=(offset=0, logStartOffset=0, maxBytes=1048576)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=1746351516, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) java.io.IOException: Connection to eg2-pp-ifs-245:9092 (id: 2 rack: null) failed. at org.apache.kafka.clients.NetworkClientUtils.awaitReady(NetworkClientUtils.java:70) at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:91) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:220) at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43) at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:146) at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) log4j:ERROR Failed to rename [/logs/log-cleaner.log] to [/logs/log-cleaner.log.2018-05-11-16]. [2018-05-11 17:12:22,323] ERROR Failed to clean up log for __consumer_offsets-31 in dir C:\kafka1 due to IOException (kafka.server.LogDirFailureChannel) java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-31\00000000000000000000.log.cleaned -> C:\kafka1\__consumer_offsets-31\00000000000000000000.log.swap: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:387) at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287) at java.nio.file.Files.move(Files.java:1395) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:697) at org.apache.kafka.common.record.FileRecords.renameTo(FileRecords.java:212) at kafka.log.LogSegment.changeFileSuffixes(LogSegment.scala:415) at kafka.log.Log.replaceSegments(Log.scala:1644) at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:535) at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462) at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.log.Cleaner.doClean(LogCleaner.scala:461) at kafka.log.Cleaner.clean(LogCleaner.scala:438) at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305) at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) Suppressed: java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-31\00000000000000000000.log.cleaned -> C:\kafka1\__consumer_offsets-31\00000000000000000000.log.swap: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) at sun.nio.fs.WindowsFileCopy.move(WindowsFileCopy.java:301) at sun.nio.fs.WindowsFileSystemProvider.move(WindowsFileSystemProvider.java:287) at java.nio.file.Files.move(Files.java:1395) at org.apache.kafka.common.utils.Utils.atomicMoveWithFallback(Utils.java:694) ... 12 more [2018-05-11 17:12:22,339] INFO [ReplicaManager broker=1] Stopping serving replicas in dir C:\kafka1 (kafka.server.ReplicaManager) [2018-05-11 17:12:22,370] INFO [ReplicaFetcherManager on broker 1] Removed fetcher for partitions __consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,supplier_db-0,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,860-0,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__consumer_ sets-40 (kafka.server.ReplicaFetcherManager) [2018-05-11 17:12:22,370] INFO [ReplicaAlterLogDirsManager on broker 1] Removed fetcher for partitions __consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,supplier_db-0,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,860-0,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__cons r_offsets-40 (kafka.server.ReplicaAlterLogDirsManager) [2018-05-11 17:12:22,402] INFO [ReplicaManager broker=1] Broker 1 stopped fetcher for partitions __consumer_offsets-22,__consumer_offsets-30,__consumer_offsets-8,__consumer_offsets-21,__consumer_offsets-4,__consumer_offsets-27,__consumer_offsets-7,__consumer_offsets-9,__consumer_offsets-46,__consumer_offsets-25,__consumer_offsets-35,__consumer_offsets-41,__consumer_offsets-33,__consumer_offsets-23,__consumer_offsets-49,__consumer_offsets-47,__consumer_offsets-16,__consumer_offsets-28,__consumer_offsets-31,__consumer_offsets-36,__consumer_offsets-42,__consumer_offsets-3,__consumer_offsets-18,__consumer_offsets-37,__consumer_offsets-15,__consumer_offsets-24,__consumer_offsets-38,__consumer_offsets-17,__consumer_offsets-48,__consumer_offsets-19,__consumer_offsets-11,__consumer_offsets-13,__consumer_offsets-2,__consumer_offsets-43,__consumer_offsets-6,__consumer_offsets-14,__consumer_offsets-20,__consumer_offsets-0,__consumer_offsets-44,__consumer_offsets-39,supplier_db-0,__consumer_offsets-12,__consumer_offsets-45,__consumer_offsets-1,__consumer_offsets-5,__consumer_offsets-26,860-0,__consumer_offsets-29,__consumer_offsets-34,__consumer_offsets-10,__consumer_offsets-32,__consumer_o ets-40 and stopped moving logs for partitions because they are in the failed log directory C:\kafka1. (kafka.server.ReplicaManager) [2018-05-11 17:12:22,402] ERROR Failed to clean up log for __consumer_offsets-31 in dir C:\kafka1 due to IOException (kafka.server.LogDirFailureChannel) java.nio.file.FileSystemException: C:\kafka1\__consumer_offsets-31\00000000000000000000.log.cleaned: The process cannot access the file because it is being used by another process. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:86) at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:97) at sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:102) at sun.nio.fs.WindowsFileSystemProvider.implDelete(WindowsFileSystemProvider.java:269) at sun.nio.fs.AbstractFileSystemProvider.deleteIfExists(AbstractFileSystemProvider.java:108) at java.nio.file.Files.deleteIfExists(Files.java:1165) at kafka.log.Cleaner.deleteCleanedFileIfExists$1(LogCleaner.scala:488) at kafka.log.Cleaner.cleanSegments(LogCleaner.scala:493) at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:462) at kafka.log.Cleaner$$anonfun$doClean$4.apply(LogCleaner.scala:461) at scala.collection.immutable.List.foreach(List.scala:392) at kafka.log.Cleaner.doClean(LogCleaner.scala:461) at kafka.log.Cleaner.clean(LogCleaner.scala:438) at kafka.log.LogCleaner$CleanerThread.cleanOrSleep(LogCleaner.scala:305) at kafka.log.LogCleaner$CleanerThread.doWork(LogCleaner.scala:291) at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) [2018-05-11 17:12:22,402] INFO Stopping serving logs in dir C:\kafka1 (kafka.log.LogManager) [2018-05-11 17:12:22,417] ERROR Shutdown broker because all log dirs in C:\kafka1 have failed (kafka.log.LogManager) The only alternative we see is to keep the log and offset valid for infinitely long, but that will bombard our server drives which is not desirable. I can see that there was a KIP which enforced atomic move of log files to avoid such issues on Windows. Here was the pull request https://github.com/apache/kafka/pull/771. Looks like it was an improvement but not necessarily a guaranteed fix for Windows. Could someone please advise what to do here? It's not possible for us to migrate to Linux at the moment so any workaround is highly appreciated. Regards,